audit-log-lm

class model.vocab.EHRAuditLogitsProcessor(vocab)¶

Parameters:: vocab (EHRVocab) – ~model.vocab.EHRVocab instance to use for processing.

class model.vocab.EHRAuditTokenizer(vocab, timestamp_spaces_cal=None, user_col='PAT_ID', user_max=128, timestamp_col='ACCESS_TIME', timestamp_sort_cols=['ACCESS_TIME', 'ACCESS_INSTANT'], event_type_cols=['METRIC_NAME'], max_length=1024, pat_ids_cat=False)¶

Tokenizer for an EHR audit log sequence using a vocab.

Parameters:

vocab (EHRVocab) – ~model.vocab.EHRVocab instance to use for tokenization.
timestamp_spaces_cal (List[float]) – List of timestamp spaces to use for quantization.
user_col (str) – Name of the user ID column.
user_max (int) – Maximum user IDs.
timestamp_col (str) – Name of the timestamp column.
timestamp_sort_cols (List[str]) – Columns to sort by before tokenization.
event_type_cols (List[str]) – Columns that describe the event type.
max_length (int) – Maximum length of the input sequence (i.e. the context length of the model).
pat_ids_cat (bool) – Whether patient IDs should be treated categorically/in the order they appear.

batch_decode(token_ids, output_type=None)¶

Decodes a list of lists of token IDs into a list of lists of field-value pairs and converts to the desired output type :param token_ids: List of lists of token IDs to decode. :param output_type: Output type to convert to. If None, defaults to a Pandas DataFrame. :return: A list of lists of field-value pairs.

Parameters:

token_ids (List[List[int]])
output_type (type)

decode(token_ids, output_type=None)¶

Decodes a list of token IDs into a list of field-value pairs and converts to the desired output type :param token_ids: List of token IDs to decode. :param output_type: Output type to convert to. If None, defaults to a Pandas DataFrame. :return: A list of field-value pairs.

Parameters:

token_ids (List[int])
output_type (type)

encode(df)¶

Encodes an audit log DataFrame into a tokenized sequence.

Parameters:: df (DataFrame)
Returns:

class model.vocab.EHRVocab(categorical_column_opts=None, vocab_path=None)¶

Vocabulary for the EHR audit log dataset, styled after Padhi et al. (2021).

There are a few components to the vocab: * Field tokens: Mapping from field, to value, to token. * Field IDs: Mapping from field to list of token IDs. * Global tokens: Mapping from all token IDs to field, value, and field ID.

Parameters:

categorical_column_opts (Dict[str, List[str]]) – Mapping of categorical column names to the list of possible values.
max_len – Maximum length of the input sequence.
vocab_path – Where to save/load the vocab.

field_names(include_special=False)¶: Returns the field names. :param include_special: Whether to include special tokens. :return:

field_to_token(field, value)¶

Converts a field-value pair to a token.

Parameters:

field – The field name.
value

Returns:

global_to_token(global_id)¶

Converts a global ID to a field, value, and field ID.

Parameters:: global_id – Global ID to convert.
Returns:: Returns a tuple of field, value, and field ID.

globals_to_locals(global_ids)¶

Slow version of globals_to_locals_torch.

Parameters:: global_ids (Tensor) – Global IDs to convert.
Returns:: Local IDs.

globals_to_locals_torch(global_ids, field_start)¶

Converts global IDs to local IDs, clamping below only.

Basic idea is each sub-vocab is always fixed in location, so the local ids are = global ids - offset. If this assumption is not true, then it will break. :param global_ids: Global IDs to convert. :param field_start: Field start. :return: Local IDs.

Parameters:

global_ids (Tensor)
field_start (int)

save()¶

Saves the vocab to the vocab path.

Parameters:: vocab_path – Path to save the vocab.

audit-log-lm

Navigation

Related Topics