Preprocessor
The Preprocessor class provides methods to automatically extract event sequences from various common data formats. To start sequencing, first create the Preprocessor object.
- class preprocessor.Preprocessor(length, timeout, NO_EVENT=- 1337)[source]
Preprocessor for loading data from standard data formats.
- Preprocessor.__init__(length, timeout, NO_EVENT=- 1337)[source]
Preprocessor for loading data from standard data formats.
- Parameters
length (int) – Number of events in context.
timeout (float) – Maximum time between context event and the actual event in seconds.
NO_EVENT (int, default=-1337) – ID of NO_EVENT event, i.e., event returned for context when no event was present. This happens in case of timeout or if an event simply does not have enough preceding context events.
Formats
- We currently support the following formats:
.csv
files containing a header row that specifies the columns ‘timestamp’, ‘event’ and ‘machine’..txt
files containing a line for each machine and a sequence of events (integers) separated by spaces.
Transforming .csv
files into sequences is the quickest method and is done by the following method call:
- Preprocessor.csv(path, nrows=None, labels=None, verbose=False)[source]
Preprocess data from csv file.
Note
Format: The assumed format of a .csv file is that the first line of the file contains the headers, which should include
timestamp
,machine
,event
(and optionallylabel
). The remaining lines of the .csv file will be interpreted as data.- Parameters
path (string) – Path to input file from which to read data.
nrows (int, default=None) – If given, limit the number of rows to read to nrows.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
Transforming .txt
files into sequences is slower, but still possible using the following method call:
- Preprocessor.text(path, nrows=None, labels=None, verbose=False)[source]
Preprocess data from text file.
Note
Format: The assumed format of a text file is that each line in the text file contains a space-separated sequence of event IDs for a machine. I.e. for n machines, there will be n lines in the file.
- Parameters
path (string) – Path to input file from which to read data.
nrows (int, default=None) – If given, limit the number of rows to read to nrows.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
Future supported formats
Note
These formats already have an API entrance, but are currently NOT supported.
.json
files containing values for ‘timestamp’, ‘event’ and ‘machine’..ndjson
where each line contains a json file with keys ‘timestamp’, ‘event’ and ‘machine’.
- Preprocessor.json(path, labels=None, verbose=False)[source]
Preprocess data from json file.
Note
json preprocessing will become available in a future version.
- Parameters
path (string) – Path to input file from which to read data.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
- Preprocessor.ndjson(path, labels=None, verbose=False)[source]
Preprocess data from ndjson file.
Note
ndjson preprocessing will become available in a future version.
- Parameters
path (string) – Path to input file from which to read data.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.