Welcome to DeepLog’s documentation!
DeepLog provides a pytorch implementation of Deeplog: Anomaly detection and diagnosis from system logs through deep learning. This code was implemented as part of the IEEE S&P 2022 DeepCASE: Semi-Supervised Contextual Analysis of Security Events paper. We ask people to cite both works when using the software for academic research papers, see Citing for more information.
Installation
The most straigtforward way of installing DeepLog is via pip
pip install deeplog
From source
If you wish to stay up to date with the latest development version, you can instead download the source code. In this case, make sure that you have all the required dependencies installed.
Once the dependencies have been installed, run:
pip install -e <path/to/directory/containing/deeplog/setup.py>
Dependencies
DeepLog requires the following python packages to be installed:
argformat: https://github.com/Thijsvanede/argformat
numpy: https://numpy.org/
scikit-learn: https://scikit-learn.org/
pytorch: https://pytorch.org/
All dependencies should be automatically downloaded if you install DeepLog via pip. However, should you want to install these libraries manually, you can install the dependencies using the requirements.txt file
pip install -r requirements.txt
Or you can install these libraries yourself
pip install -U argformat numpy scikit-learn torch
Usage
This section gives a high-level overview of the modules implemented by DeepLog. Furthermore it provides insights into the use of the command line tool. We also include several working examples to guide users through the code. For detailed documentation of individual methods, we refer to the Reference guide.
Overview
This section explains the design of DeepLog on a high level.
DeepLog is a network that is implemented as a torch-train Module
, which is an extension of torch.nn.Module
including automatic methods to fit()
and predict()
data.
This means it can be trained and used as any neural network module in the pytorch library.
In addition, we provide automatic methods to train and predict events given previous event sequences using the torch-train library.
This follows a scikit-learn
approach with fit()
, predict()
and fit_predict()
methods.
We refer to its documentation for a detailed description.
Command line tool
When DeepLog is installed, it can be used from the command line.
The __main__.py
file in the deeplog
module implements this command line tool.
The command line tool provides a quick and easy interface to predict sequences from .csv
files.
The full command line usage is given in its help
page:
usage: deeplog.py [-h] [--csv CSV] [--txt TXT] [--length LENGTH] [--timeout TIMEOUT] [--hidden HIDDEN]
[-i INPUT] [-l LAYERS] [-k TOP] [--save SAVE] [--load LOAD] [-b BATCH_SIZE]
[-d DEVICE] [-e EPOCHS]
{train,predict}
Deeplog: Anomaly detection and diagnosis from system logs through deep learning
positional arguments:
{train,predict} mode in which to run DeepLog
optional arguments:
-h, --help show this help message and exit
Input parameters:
--csv CSV CSV events file to process
--txt TXT TXT events file to process
--length LENGTH sequence LENGTH (default = 20)
--timeout TIMEOUT sequence TIMEOUT (seconds) (default = inf)
DeepLog parameters:
--hidden HIDDEN hidden dimension (default = 64)
-i, --input INPUT input dimension (default = 300)
-l, --layers LAYERS number of lstm layers to use (default = 2)
-k, --top TOP accept any of the TOP predictions (default = 1)
--save SAVE save DeepLog to specified file
--load LOAD load DeepLog from specified file
Training parameters:
-b, --batch-size BATCH_SIZE batch size (default = 128)
-d, --device DEVICE train using given device (cpu|cuda|auto) (default = auto)
-e, --epochs EPOCHS number of epochs to train with (default = 10)
Examples
Use first half of <data.csv>
to train DeepLog and use second half of <data.csv>
to predict and test the prediction.
python3 -m deeplog train --csv <data.csv> --save deeplog.save # Training
python3 -m deeplog predict --csv <data.csv> --load deeplog.save # Predicting
Code
To use DeepLog into your own project, you can use it as a standalone module. Here we show some simple examples on how to use the DeepLog package in your own python code. For a complete documentation we refer to the Reference guide.
Import
To import components from DeepLog simply use the following format
from deeplog import <Object>
from deeplog.<module> import <Object>
For example, the following code imports the DeepLog neural network as found in the Reference.
# Imports
from deeplog import DeepLog
Working example
In this example, we load data from either a .csv
or .txt
file and use that data to train and predict with DeepLog.
# import DeepLog and Preprocessor
from deeplog import DeepLog
from deeplog.preprocessor import Preprocessor
##############################################################################
# Load data #
##############################################################################
# Create preprocessor for loading data
preprocessor = Preprocessor(
length = 20, # Extract sequences of 20 items
timeout = float('inf'), # Do not include a maximum allowed time between events
)
# Load data from csv file
X, y, label, mapping = preprocessor.csv("<path/to/file.csv>")
# Load data from txt file
X, y, label, mapping = preprocessor.txt("<path/to/file.txt>")
##############################################################################
# DeepLog #
##############################################################################
# Create DeepLog object
deeplog = DeepLog(
input_size = 300, # Number of different events to expect
hidden_size = 64 , # Hidden dimension, we suggest 64
output_size = 300, # Number of different events to expect
)
# Optionally cast data and DeepLog to cuda, if available
deeplog = deeplog.to("cuda")
X = X .to("cuda")
y = y .to("cuda")
# Train deeplog
deeplog.fit(
X = X,
y = y,
epochs = 10,
batch_size = 128,
)
# Predict using deeplog
y_pred, confidence = deeplog.predict(
X = X,
y = y,
k = 3,
)
Reference
This is the reference documentation for the classes and methods objects provided by the DeepLog module.
Preprocessor
The Preprocessor class provides methods to automatically extract event sequences from various common data formats. To start sequencing, first create the Preprocessor object.
- class preprocessor.Preprocessor(length, timeout, NO_EVENT=- 1337)[source]
Preprocessor for loading data from standard data formats.
- Preprocessor.__init__(length, timeout, NO_EVENT=- 1337)[source]
Preprocessor for loading data from standard data formats.
- Parameters
length (int) – Number of events in context.
timeout (float) – Maximum time between context event and the actual event in seconds.
NO_EVENT (int, default=-1337) – ID of NO_EVENT event, i.e., event returned for context when no event was present. This happens in case of timeout or if an event simply does not have enough preceding context events.
Formats
- We currently support the following formats:
.csv
files containing a header row that specifies the columns ‘timestamp’, ‘event’ and ‘machine’..txt
files containing a line for each machine and a sequence of events (integers) separated by spaces.
Transforming .csv
files into sequences is the quickest method and is done by the following method call:
- Preprocessor.csv(path, nrows=None, labels=None, verbose=False)[source]
Preprocess data from csv file.
Note
Format: The assumed format of a .csv file is that the first line of the file contains the headers, which should include
timestamp
,machine
,event
(and optionallylabel
). The remaining lines of the .csv file will be interpreted as data.- Parameters
path (string) – Path to input file from which to read data.
nrows (int, default=None) – If given, limit the number of rows to read to nrows.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
Transforming .txt
files into sequences is slower, but still possible using the following method call:
- Preprocessor.text(path, nrows=None, labels=None, verbose=False)[source]
Preprocess data from text file.
Note
Format: The assumed format of a text file is that each line in the text file contains a space-separated sequence of event IDs for a machine. I.e. for n machines, there will be n lines in the file.
- Parameters
path (string) – Path to input file from which to read data.
nrows (int, default=None) – If given, limit the number of rows to read to nrows.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
Future supported formats
Note
These formats already have an API entrance, but are currently NOT supported.
.json
files containing values for ‘timestamp’, ‘event’ and ‘machine’..ndjson
where each line contains a json file with keys ‘timestamp’, ‘event’ and ‘machine’.
- Preprocessor.json(path, labels=None, verbose=False)[source]
Preprocess data from json file.
Note
json preprocessing will become available in a future version.
- Parameters
path (string) – Path to input file from which to read data.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
- Preprocessor.ndjson(path, labels=None, verbose=False)[source]
Preprocess data from ndjson file.
Note
ndjson preprocessing will become available in a future version.
- Parameters
path (string) – Path to input file from which to read data.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
DeepLog
The DeepLog class uses the torch-train library for training and prediction. This class implements the neural network as described in the paper Deeplog: Anomaly detection and diagnosis from system logs through deep learning.
Initialization
- DeepLog.__init__(input_size, hidden_size, output_size, num_layers=2)[source]
DeepLog model used for training and predicting logs.
- Parameters
input_size (int) – Dimension of input layer.
hidden_size (int) – Dimension of hidden layer.
output_size (int) – Dimension of output layer.
num_layers (int, default=2) – Number of hidden layers, i.e. stacked LSTM modules.
Forward
As DeepLog is a Neural Network, it implements the forward()
method which passes input through the entire network.
Fit
DeepLog inherits its fit method from the torch-train module. See the documentation for a complete reference.
- DeepLog.fit(X, y, epochs=10, batch_size=32, learning_rate=0.01, criterion=torch.nn.NLLLoss, optimizer=torch.optim.SGD, variable=False, verbose=True, **kwargs)
Train the module with given parameters
- Parameters
X (torch.Tensor) – Tensor to train with
y (torch.Tensor) – Target tensor
epochs (int, default=10) – Number of epochs to train with
batch_size (int, default=32) – Default batch size to use for training
learning_rate (float, default=0.01) – Learning rate to use for optimizer
criterion (nn.Loss, default=nn.NLLLoss) – Loss function to use
optimizer (optim.Optimizer, default=optim.SGD) – Optimizer to use for training
variable (boolean, default=False) – If True, accept inputs of variable length
verbose (boolean, default=True) – If True, prints training progress
- Returns
result – Returns self
- Return type
self
Predict
The regular network gives a probability distribution over all possible output values.
However, DeepLog outputs the k most likely outputs, therefore it overwrites the predict()
method of the Module
class from torch-train.
- DeepLog.predict(X, y=None, k=1, variable=False, verbose=True)[source]
Predict the k most likely output values
- Parameters
X (torch.Tensor of shape=(n_samples, seq_len)) – Input of sequences, these will be one-hot encoded to an array of shape=(n_samples, seq_len, input_size)
y (Ignored) – Ignored
k (int, default=1) – Number of output items to generate
variable (boolean, default=False) – If True, predict inputs of different sequence lengths
verbose (boolean, default=True) – If True, print output
- Returns
result (torch.Tensor of shape=(n_samples, k)) – k most likely outputs
confidence (torch.Tensor of shape=(n_samples, k)) – Confidence levels for each output
Contributors
This page lists all the contributors to this project. If you want to be involved in maintaining code or adding new features, please email t(dot)s(dot)vanede(at)utwente(dot)nl.
Code
Thijs van Ede
Academic Contributors
Thijs van Ede
Hojjat Aghakhani
Noah Spahn
Riccardo Bortolameotti
Marco Cova
Andrea Continella
Maarten van Steen
Andreas Peter
Christopher Kruegel
Giovanni Vigna
License
MIT License
Copyright (c) 2021 Thijs van Ede
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Citing
To cite DeepLog please use the following publications:
van Ede, T., Aghakhani, H., Spahn, N., Bortolameotti, R., Cova, M., Continella, A., van Steen, M., Peter, A., Kruegel, C. & Vigna, G. (2022, May). DeepCASE: Semi-Supervised Contextual Analysis of Security Events. In 2022 Proceedings of the IEEE Symposium on Security and Privacy (S&P). IEEE. [PDF DeepCASE]
Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS) (pp. 1285-1298). [PDF DeepLog]
Bibtex
DeepCASE
@inproceedings{vanede2020deepcase,
title={{DeepCASE: Semi-Supervised Contextual Analysis of Security Events}},
author={van Ede, Thijs and Aghakhani, Hojjat and Spahn, Noah and Bortolameotti, Riccardo and Cova, Marco and Continella, Andrea and van Steen, Maarten and Peter, Andreas and Kruegel, Christopher and Vigna, Giovanni},
booktitle={Proceedings of the IEEE Symposium on Security and Privacy (S&P)},
year={2022},
organization={IEEE}
}
DeepLog
@inproceedings{du2017deeplog,
title={Deeplog: Anomaly detection and diagnosis from system logs through deep learning},
author={Du, Min and Li, Feifei and Zheng, Guineng and Srikumar, Vivek},
booktitle={Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security},
pages={1285--1298},
year={2017}
}