- class model.vocab.EHRAuditLogitsProcessor(vocab)¶
- Parameters:
vocab (EHRVocab) –
~model.vocab.EHRVocab
instance to use for processing.
- class model.vocab.EHRAuditTokenizer(vocab, timestamp_spaces_cal=None, user_col='PAT_ID', user_max=128, timestamp_col='ACCESS_TIME', timestamp_sort_cols=['ACCESS_TIME', 'ACCESS_INSTANT'], event_type_cols=['METRIC_NAME'], max_length=1024, pat_ids_cat=False)¶
Tokenizer for an EHR audit log sequence using a vocab.
- Parameters:
vocab (EHRVocab) –
~model.vocab.EHRVocab
instance to use for tokenization.timestamp_spaces_cal (List[float]) – List of timestamp spaces to use for quantization.
user_col (str) – Name of the user ID column.
user_max (int) – Maximum user IDs.
timestamp_col (str) – Name of the timestamp column.
timestamp_sort_cols (List[str]) – Columns to sort by before tokenization.
event_type_cols (List[str]) – Columns that describe the event type.
max_length (int) – Maximum length of the input sequence (i.e. the context length of the model).
pat_ids_cat (bool) – Whether patient IDs should be treated categorically/in the order they appear.
- batch_decode(token_ids, output_type=None)¶
Decodes a list of lists of token IDs into a list of lists of field-value pairs and converts to the desired output type :param token_ids: List of lists of token IDs to decode. :param output_type: Output type to convert to. If None, defaults to a Pandas DataFrame. :return: A list of lists of field-value pairs.
- Parameters:
token_ids (List[List[int]])
output_type (type)
- decode(token_ids, output_type=None)¶
Decodes a list of token IDs into a list of field-value pairs and converts to the desired output type :param token_ids: List of token IDs to decode. :param output_type: Output type to convert to. If None, defaults to a Pandas DataFrame. :return: A list of field-value pairs.
- Parameters:
token_ids (List[int])
output_type (type)
- encode(df)¶
Encodes an audit log DataFrame into a tokenized sequence.
- Parameters:
df (DataFrame)
- Returns:
- class model.vocab.EHRVocab(categorical_column_opts=None, vocab_path=None)¶
Vocabulary for the EHR audit log dataset, styled after Padhi et al. (2021).
There are a few components to the vocab: * Field tokens: Mapping from field, to value, to token. * Field IDs: Mapping from field to list of token IDs. * Global tokens: Mapping from all token IDs to field, value, and field ID.
- Parameters:
categorical_column_opts (Dict[str, List[str]]) – Mapping of categorical column names to the list of possible values.
max_len – Maximum length of the input sequence.
vocab_path – Where to save/load the vocab.
- field_names(include_special=False)¶
Returns the field names. :param include_special: Whether to include special tokens. :return:
- field_to_token(field, value)¶
Converts a field-value pair to a token.
- Parameters:
field – The field name.
value
- Returns:
- global_to_token(global_id)¶
Converts a global ID to a field, value, and field ID.
- Parameters:
global_id – Global ID to convert.
- Returns:
Returns a tuple of field, value, and field ID.
- globals_to_locals(global_ids)¶
Slow version of globals_to_locals_torch.
- Parameters:
global_ids (Tensor) – Global IDs to convert.
- Returns:
Local IDs.
- globals_to_locals_torch(global_ids, field_start)¶
Converts global IDs to local IDs, clamping below only.
Basic idea is each sub-vocab is always fixed in location, so the local ids are = global ids - offset. If this assumption is not true, then it will break. :param global_ids: Global IDs to convert. :param field_start: Field start. :return: Local IDs.
- Parameters:
global_ids (Tensor)
field_start (int)
- save()¶
Saves the vocab to the vocab path.
- Parameters:
vocab_path – Path to save the vocab.