This vignette gives you the knowledge you need to create your own
diseasystore
.
To begin, we go through the data model used within the
diseasystores
. It is this data model that enables the
automatic coupling of features and powers the package.
The data created by diseasystores
are so-called
“bitemporal” data. This means we have two temporal dimensions. One
representing the validity of the record, and one representing the
availability of the record.
valid_from
and valid_until
The validity dimension indicates when a given data point is “valid”, e.g. a hospitalisation is valid between admission and discharge date. This temporal dimension should be familiar to you is simply “regular” time.
We encode the validity information into the columns
valid_from
and valid_until
such that a record
is valid for any time t
which satisfies
valid_from <= t < valid_until
. For many features, the
validity is a single day (such as a test result) and the
valid_until
column will be the day after
valid_from
.
By convention, we place these column as the last columns of the table1.
from_ts
and until_ts
diseasystore
uses {SCDB}
in the background
to store the computed features. SCDB
implements the second
temporal dimension which indicates when a record was present in the
data. This information is encoded in the columns from_ts
and until_ts
. Normally, you don’t see these columns when
working with diseasystore
since they are masked by
SCDB
. However, if you inspect the tables created in the
database by diseasystore, you will find they are present. For our
purposes, it is sufficient to know that these column gives a
time-versioned data base where we can extract previous versions through
the slice_ts
argument. By supplying any time τ
as slice_ts
, we get the data as they were available on that
date. This allows us to build continuous integration of our features
while preserving previously computed features.
A primary feature of diseasystore
is its ability to
automatically couple and aggregate features. This coupling requires
common “key_*” columns between the features. Any feature in a
diseasystore
therefore must have at least one “key_*”
column. By convention, we place these column as the first columns of the
table.
Finally, we come to the main data of the diseasystore
,
namely the features. First, a reminder that “feature” here comes from
machine learning and is any individual piece of information.
We subdivide features into two categories: “observables” and “stratifications”. On most levels, these are indistinguishable, but their purposes differ and hence we need to handle them individually.
In diseasystore
any feature whose name either starts
with “n_” or ends with “_temperature” are treated as “observables”. From
a modelling perspective, these observables are typically the metrics you
want to model or take as inputs to inform your model.
Conversely, any other feature is a “stratification” feature. These features are the variables used to subdivide your analysis to match the structure of your model (hence why they are called stratification features).
A prominent example for most disease models would be a stratification feature like “age_group”, since most diseases show a strong dependency on the age of the affected individuals.
While there is no formal requirement for the naming of the
observables or stratifications, it is considered best practice to use
the same names as other diseasystores
for features where
possible2. This simplifies the process of adapting
analyses and disease models to new diseasystores
.
To facilitate the automatic coupling and aggregation of features, we
use the ?FeatureHandler
class. Each feature3 in the
diseasystore
has an associated FeatureHandler
which implements the computation, retrieval and aggregation of the
feature.
The FeatureHandler
defines a compute
function which must be on the form:
compute = function(start_date, end_date, slice_ts, source_conn)
The arguments start_date
and end_date
indicates the period for which features should be computed. The
diseasystores
are dynamically expanded,
so feature computation is often restricted to limited time intervals as
indicated by start_date
and end_date
.
As mentioned above
slice_ts
specifies what date the should be computed for.
E.g. if slice_ts
is the current date, the current features
should be computed. Conversely, if slice_ts
is some past
date, features corresponding to this date should be computed.
Lastly, the source_conn is a flexible argument passed to the FeatureHandler indicating where the source data needed to compute the features is stored (e.g. a database connection or directory).
Note that multiple features can be computed by a single
FeatureHandler
. For example, you may decide that it is more
convenient for compute multiple different features simultaneously
(e.g. a hospitalisation and the classification of said hospitalisation
or a test and the associated test result).
The FeatureHandler
defines a $get()
function which must be in the form:
get = function(target_table, slice_ts, target_conn)
Typically, you do not need to specify this function since the default
(a variant of SCDB::get_table()
) always works.
However, in the case that you do need to specify it, the
target_table
argument will be a DBI::Id
specifying the location of the data base table where the features are
stored. target_conn
is connection to the database. And as
above, slice_ts
is the time-keeping variable.
The FeatureHandler
defines a key_join
function which must be on the form:
key_join = function(.data, feature)
In most cases, you should be able to use the bundled
key_join_*
functions (see ?aggregators
for a
full list).
In the event, that you need to create your own aggregator the arguments are as follows:
.data
is a grouped data.frame
whose
groups are those specified by the stratification
argument
(see Automatic
aggregation).
feature
is the name of the feature(s) to
aggregate.
Your aggregator should return a dplyr::summarise()
call
that operates on all columns specified in the feature
argument.
By now, you should know the basics of creating your own
FeatureHandlers
. To see some FeatureHandlers
in action, you can consult a few of those bundled with the
diseasystore
package.
For example:
diseasystore
With the knowledge of how to build custom
FeatureHandlers
, we turn our attention to the remaining
parts of the diseasystore
’s anatomy.
The diseasystores
are R6 classes which is a
implementation of object-oriented (OO) programming. To those unfamiliar
with OO programming, the diseasystores
are single “objects”
with a number of “public” and “private” functions and variables. The
public functions and variables are visible to the user of the
diseasystore
with the private functions and variables are
visible only to us (the developers).
When extending diseasystore
, we are only writing private
functions and variables. The public functions and variables are handled
elsewhere4.
The ds_map
field of the diseasystore
tells
the diseasystore
which FeatureHandler
is
responsible for each feature, thus allowing the
diseasystore
to retrieve the features specified in the
observable
and stratification
arguments of
calls to $get_feature()
.
In other words, it maps the names of features to their corresponding
FeatureHandlers
.
As we saw above, a FeatureHandler
may compute more than
a single feature. Each feature should be mapped to the
FeatureHandler
here or else the diseasystore
will not be able to automatically interact with it.
By convention, the name of the FeatureHandler
should be
snake_case and contain a diseasystore
specific prefix
(e.g. for DiseasystoreGoogleCovid19
, all
FeatureHandlers
are named “google_covid_19_
These names are used as the table names when storing the features in the database, and the prefix helps structure the database accordingly.
This latter part becomes important when clean up for the data base needs to be performed.
The diseasystore
are made to be as flexible as possible
which means that it can incorporate both individual level data and
semi-aggregated data. For semi-aggregated data, it is often the case
that the data includes aggregations at different levels, nested within
the data.
For example, the Google COVID-19 data repository contains information
on both country-level and region-level in the same data files. When the
user of DiseasystoreGoogleCovid19
asks to get a feature
stratified by, for example, “country_id”, we need to filter out the data
aggregated at the region level.
This is the purpose of $key_join_filter()
. It takes as
input the requested stratifications and filters the data accordingly
after the features have been joined inside the
diseasystore
.
For an example, you can consult DiseasystoreGoogleCovid19: key_join_filter
diseasystore
The diseasystore
package includes the function
test_diseasystore()
to test the diseasystores
.
You can see how to call the testing suite in action with
DiseasystoreGoogleCovid19
as an example here.
The SCDB
package places
checksum
, from_ts
, and until_ts
as the last columns. But valid_from
and
valid_until
should be the last columns in the output passed
to SCDB
.↩︎
In practice, this means that the names of features
should be in snake_case
.↩︎
Or “coupled” set of features as we will soon see.↩︎
By the DiseasystoreBase
class.↩︎