class model.vocab.EHRAuditLogitsProcessor(vocab)
Parameters:

vocab (EHRVocab) – ~model.vocab.EHRVocab instance to use for processing.

class model.vocab.EHRAuditTokenizer(vocab, timestamp_spaces_cal=None, user_col='PAT_ID', user_max=128, timestamp_col='ACCESS_TIME', timestamp_sort_cols=['ACCESS_TIME', 'ACCESS_INSTANT'], event_type_cols=['METRIC_NAME'], max_length=1024, pat_ids_cat=False)

Tokenizer for an EHR audit log sequence using a vocab.

Parameters:
  • vocab (EHRVocab) – ~model.vocab.EHRVocab instance to use for tokenization.

  • timestamp_spaces_cal (List[float]) – List of timestamp spaces to use for quantization.

  • user_col (str) – Name of the user ID column.

  • user_max (int) – Maximum user IDs.

  • timestamp_col (str) – Name of the timestamp column.

  • timestamp_sort_cols (List[str]) – Columns to sort by before tokenization.

  • event_type_cols (List[str]) – Columns that describe the event type.

  • max_length (int) – Maximum length of the input sequence (i.e. the context length of the model).

  • pat_ids_cat (bool) – Whether patient IDs should be treated categorically/in the order they appear.

batch_decode(token_ids, output_type=None)

Decodes a list of lists of token IDs into a list of lists of field-value pairs and converts to the desired output type :param token_ids: List of lists of token IDs to decode. :param output_type: Output type to convert to. If None, defaults to a Pandas DataFrame. :return: A list of lists of field-value pairs.

Parameters:
  • token_ids (List[List[int]])

  • output_type (type)

decode(token_ids, output_type=None)

Decodes a list of token IDs into a list of field-value pairs and converts to the desired output type :param token_ids: List of token IDs to decode. :param output_type: Output type to convert to. If None, defaults to a Pandas DataFrame. :return: A list of field-value pairs.

Parameters:
  • token_ids (List[int])

  • output_type (type)

encode(df)

Encodes an audit log DataFrame into a tokenized sequence.

Parameters:

df (DataFrame)

Returns:

class model.vocab.EHRVocab(categorical_column_opts=None, vocab_path=None)

Vocabulary for the EHR audit log dataset, styled after Padhi et al. (2021).

There are a few components to the vocab: * Field tokens: Mapping from field, to value, to token. * Field IDs: Mapping from field to list of token IDs. * Global tokens: Mapping from all token IDs to field, value, and field ID.

Parameters:
  • categorical_column_opts (Dict[str, List[str]]) – Mapping of categorical column names to the list of possible values.

  • max_len – Maximum length of the input sequence.

  • vocab_path – Where to save/load the vocab.

field_names(include_special=False)

Returns the field names. :param include_special: Whether to include special tokens. :return:

field_to_token(field, value)

Converts a field-value pair to a token.

Parameters:
  • field – The field name.

  • value

Returns:

global_to_token(global_id)

Converts a global ID to a field, value, and field ID.

Parameters:

global_id – Global ID to convert.

Returns:

Returns a tuple of field, value, and field ID.

globals_to_locals(global_ids)

Slow version of globals_to_locals_torch.

Parameters:

global_ids (Tensor) – Global IDs to convert.

Returns:

Local IDs.

globals_to_locals_torch(global_ids, field_start)

Converts global IDs to local IDs, clamping below only.

Basic idea is each sub-vocab is always fixed in location, so the local ids are = global ids - offset. If this assumption is not true, then it will break. :param global_ids: Global IDs to convert. :param field_start: Field start. :return: Local IDs.

Parameters:
  • global_ids (Tensor)

  • field_start (int)

save()

Saves the vocab to the vocab path.

Parameters:

vocab_path – Path to save the vocab.