Preprocessor

The Preprocessor class provides methods to automatically extract event sequences from various common data formats. To start sequencing, first create the Preprocessor object.

class preprocessor.Preprocessor(length, timeout, NO_EVENT=- 1337)[source]: Preprocessor for loading data from standard data formats.

Preprocessor.__init__(length, timeout, NO_EVENT=- 1337)[source]

Preprocessor for loading data from standard data formats.

Parameters

length (int) – Number of events in context.
timeout (float) – Maximum time between context event and the actual event in seconds.
NO_EVENT (int, default=-1337) – ID of NO_EVENT event, i.e., event returned for context when no event was present. This happens in case of timeout or if an event simply does not have enough preceding context events.

Formats

We currently support the following formats:

.csv files containing a header row that specifies the columns ‘timestamp’, ‘event’ and ‘machine’.
.txt files containing a line for each machine and a sequence of events (integers) separated by spaces.

Transforming .csv files into sequences is the quickest method and is done by the following method call:

Preprocessor.csv(path, nrows=None, labels=None, verbose=False)[source]

Preprocess data from csv file.

Note

Format: The assumed format of a .csv file is that the first line of the file contains the headers, which should include timestamp, machine, event (and optionally label). The remaining lines of the .csv file will be interpreted as data.

Parameters

path (string) – Path to input file from which to read data.
nrows (int, default=None) – If given, limit the number of rows to read to nrows.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.

Returns

events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.

Transforming .txt files into sequences is slower, but still possible using the following method call:

Preprocessor.text(path, nrows=None, labels=None, verbose=False)[source]

Preprocess data from text file.

Note

Format: The assumed format of a text file is that each line in the text file contains a space-separated sequence of event IDs for a machine. I.e. for n machines, there will be n lines in the file.

Parameters

path (string) – Path to input file from which to read data.
nrows (int, default=None) – If given, limit the number of rows to read to nrows.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.

Returns

events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.

Future supported formats

Note

These formats already have an API entrance, but are currently NOT supported.

.json files containing values for ‘timestamp’, ‘event’ and ‘machine’.
.ndjson where each line contains a json file with keys ‘timestamp’, ‘event’ and ‘machine’.

Preprocessor.json(path, labels=None, verbose=False)[source]

Preprocess data from json file.

Note

json preprocessing will become available in a future version.

Parameters

path (string) – Path to input file from which to read data.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.

Returns

events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.

Preprocessor.ndjson(path, labels=None, verbose=False)[source]

Preprocess data from ndjson file.

Note

ndjson preprocessing will become available in a future version.

Parameters

path (string) – Path to input file from which to read data.
labels (int or array-like of shape=(n_samples,), optional) – If a int is given, label all sequences with given int. If an array-like is given, use the given labels for the data in file. Note: will overwrite any ‘label’ data in input file.
verbose (boolean, default=False) – If True, prints progress in transforming input to sequences.

Returns

events (torch.Tensor of shape=(n_samples,)) – Events in data.
context (torch.Tensor of shape=(n_samples, context_length)) – Context events for each event in events.
labels (torch.Tensor of shape=(n_samples,)) – Labels will be None if no labels parameter is given, and if data does not contain any ‘labels’ column.