Preprocessor
The Preprocessor class provides methods to automatically extract event sequences from various common data formats. To start sequencing, first create the Preprocessor object.
- class preprocessor.Preprocessor(length, timeout, NO_EVENT=- 1337)[source]
Preprocessor for loading data from standard data formats.
- Preprocessor.__init__(length, timeout, NO_EVENT=- 1337)[source]
Preprocessor for loading data from standard data formats.
- Parameters
length (int) – Number of events in context.
timeout (float) – Maximum time between context event and the actual event in seconds.
NO_EVENT (int, default=-1337) – ID of NO_EVENT event, i.e., event returned for context when no event was present. This happens in case of timeout or if an event simply does not have enough preceding context events.
Formats
- We currently support the following formats:
.csvfiles containing a header row that specifies the columns ‘timestamp’, ‘event’ and ‘machine’..txtfiles containing a line for each machine and a sequence of events (integers) separated by spaces.
Transforming .csv files into sequences is the quickest method and is done by the following method call:
- Preprocessor.csv(path, nrows=None, labels=None, verbose=False)[source]
Preprocess data from csv file.
Note
Format: The assumed format of a .csv file is that the first line of the file contains the headers, which should include
timestamp,machine,event(and optionallylabel). The remaining lines of the .csv file will be interpreted as data.- Parameters
path (string) – Path to input file from which to read data.
nrows (int, default=None) – If given, limit the number of rows to read to nrows.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
Transforming .txt files into sequences is slower, but still possible using the following method call:
- Preprocessor.text(path, nrows=None, labels=None, verbose=False)[source]
Preprocess data from text file.
Note
Format: The assumed format of a text file is that each line in the text file contains a space-separated sequence of event IDs for a machine. I.e. for n machines, there will be n lines in the file.
- Parameters
path (string) – Path to input file from which to read data.
nrows (int, default=None) – If given, limit the number of rows to read to nrows.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
Future supported formats
Note
These formats already have an API entrance, but are currently NOT supported.
.jsonfiles containing values for ‘timestamp’, ‘event’ and ‘machine’..ndjsonwhere each line contains a json file with keys ‘timestamp’, ‘event’ and ‘machine’.
- Preprocessor.json(path, labels=None, verbose=False)[source]
Preprocess data from json file.
Note
json preprocessing will become available in a future version.
- Parameters
path (string) – Path to input file from which to read data.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.
- Preprocessor.ndjson(path, labels=None, verbose=False)[source]
Preprocess data from ndjson file.
Note
ndjson preprocessing will become available in a future version.
- Parameters
path (string) – Path to input file from which to read data.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.
- Returns
events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.