- class model.data.EHRAuditDataset(root_dir, session_sep_min=4, shift_sep_min=300, user_col='PAT_ID', user_max=128, timestamp_col='ACCESS_TIME', timestamp_sort_cols=['ACCESS_TIME', 'ACCESS_INSTANT'], event_type_cols=['METRIC_NAME'], log_name=None, vocab=None, timestamp_spaces=None, should_tokenize=True, cache=None, max_length=1024)¶
Dataset for Epic EHR audit log data.
Assumes that the data is associated with a single physician. Users IDs are transformed to the # in which they are encountered for a given shift. Shifts are a gap in time set by hyperparameter, and have 0 entropy to start. Separation of shifts and sessions are delineated by the same process. Time deltas are calculated w.r.t. the preceding event.
- Parameters:
root_dir (str) – Directory where the log file is located for a given provider.
session_sep_min (int) – Minimum separation in minutes between sessions.
shift_sep_min (int) – Minimum separation in minutes between shifts.
user_col (str) – Column name for the user ID.
user_max (int) – Maximum number of users for tokenization.
timestamp_col (str) – Column name for the timestamp.
timestamp_sort_cols (List[str]) – Columns to sort by for timestamps, ordered most to least important.
event_type_cols (List[str]) – Columns that represent the event type in the audit logs.
log_name (str) – Name of the log file.
timestamp_spaces (List[float]) – List of parameters for the timestamp space calculation.
should_tokenize (bool) – Whether to tokenize the data immediately upon loading. Should be separated shifts if not.
cache (str) – Name of the cache directory to save tokenized sequences.
max_length (int) – Maximum length of the tokenized sequences. Should generally be same as model context size.
- load()¶
Load the dataset from either a log file or a cache.
- load_from_cache(length=True, seqs=False)¶
Load the dataset from a cached file. Deliberately only load parts as needed.
- load_from_log()¶
Load the dataset from a log file.
- model.data.timestamp_space_calculation(timestamp_spaces)¶
Calculate the timestamp space based on the given parameters, i.e. function, start value, end value, and number of bins.
First item in the list is the numpy function to use (i.e. linspace, logspace). The remaining three arguments will be converted to two floats and an int. If np.linspace is used, the values will be passed directly to np.linspace. If np.logspace is used, the first value will be clipped to 0 if it is less than 1. The start and end values will be converted to log10 values and then passed to np.logspace.
- Parameters:
timestamp_spaces (List) – List of parameters for the timestamp space calculation.
- Returns:
The calculated timestamp space.