1.1 Introduction

Concept sets play an important role when working with data in the format of the OMOP CDM. They can be used to create cohorts after which, as weā€™ve seen in the previous vignette, we can identify intersections between the cohorts. PatientProfiles adds another option for working with concept sets which is use them for adding associated variables directly without first having to create a cohort.

It is important to note, and is explained more below, that results may differ when generating a cohort and then identifying intersections between two cohorts compared to working directly with concept sets. The creation of cohorts will involve the collapsing of overlapping records as well as imposing certain requirements such as only including records that were observed during an individuals observation period. When adding variables based on concept sets we will be working directly with record-level data in the OMOP CDM clinical tables.

1.2 Adding variables from concept sets

For this vignette weā€™ll use the Eunomia synthetic dataset. First lets create our cohort of interest, individuals with an ankle sprain.

library(CDMConnector)
library(CodelistGenerator)
library(PatientProfiles)
library(dplyr)
library(ggplot2)

con <- DBI::dbConnect(duckdb::duckdb(),
  dbdir = CDMConnector::eunomia_dir()
)
cdm <- CDMConnector::cdm_from_con(con,
  cdm_schem = "main",
  write_schema = "main"
)

cdm <- generateConceptCohortSet(
  cdm = cdm,
  name = "ankle_sprain",
  conceptSet = list("ankle_sprain" = 81151),
  end = "event_end_date",
  limit = "all",
  overwrite = TRUE
)
#> Warning: ! 3 casted column in ankle_sprain (cohort_attrition) as do not match expected
#>   column type:
#> ā€¢ `reason_id` from numeric to integer
#> ā€¢ `excluded_records` from numeric to integer
#> ā€¢ `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in ankle_sprain (cohort_codelist) as do not match expected
#>   column type:
#> ā€¢ `concept_id` from numeric to integer

cdm$ankle_sprain
#> # Source:   table<main.ankle_sprain> [?? x 4]
#> # Database: DuckDB v1.1.0 [root@Darwin 24.0.0:R 4.4.1//private/var/folders/pl/k11lm9710hlgl02nvzx4z9wr0000gp/T/RtmpREJO98/filedd3e3b0c5924.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <int> <date>            <date>         
#>  1                    1        859 1969-01-28        1969-02-25     
#>  2                    1        865 1993-04-11        1993-04-25     
#>  3                    1       2588 1987-01-26        1987-02-16     
#>  4                    1       2666 1971-11-21        1971-12-12     
#>  5                    1       2760 1961-01-06        1961-02-03     
#>  6                    1       3320 1978-02-07        1978-02-21     
#>  7                    1       3420 1975-09-02        1975-09-23     
#>  8                    1       4677 1966-03-01        1966-03-29     
#>  9                    1        236 2000-06-16        2000-07-07     
#> 10                    1        518 1995-10-17        1995-10-31     
#> # ā„¹ more rows

Now letā€™s say weā€™re interested in summarising use of acetaminophen among our ankle sprain cohort. We can start by identifying the relevant concepts.

acetaminophen_cs <- getDrugIngredientCodes(
  cdm = cdm,
  name = c("acetaminophen")
)

acetaminophen_cs
#> 
#> ā”€ā”€ 1 codelist ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
#> 
#> - 161_acetaminophen (7 codes)

Once we have our codes for acetaminophen we can create variables based on these. As with cohort intersections, PatientProfiles provides four types of functions for concept intersections.

First, we can add a binary flag variable indicating whether an individual had a record of acetaminophen on the day of their ankle sprain or up to 30 days afterwards.

cdm$ankle_sprain %>%
  addConceptIntersectFlag(
    conceptSet = acetaminophen_cs,
    indexDate = "cohort_start_date",
    window = c(0, 30)
  ) %>%
  dplyr::glimpse()
#> Rows: ??
#> Columns: 5
#> Database: DuckDB v1.1.0 [root@Darwin 24.0.0:R 4.4.1//private/var/folders/pl/k11lm9710hlgl02nvzx4z9wr0000gp/T/RtmpREJO98/filedd3e3b0c5924.duckdb]
#> $ cohort_definition_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦
#> $ subject_id                  <int> 865, 236, 867, 1039, 4366, 669, 2273, 2530ā€¦
#> $ cohort_start_date           <date> 1993-04-11, 2000-06-16, 1981-09-14, 1977-ā€¦
#> $ cohort_end_date             <date> 1993-04-25, 2000-07-07, 1981-09-28, 1977-ā€¦
#> $ `161_acetaminophen_0_to_30` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦

Second, we can count the number of records of acetaminophen in this same window for each individual.

cdm$ankle_sprain %>%
  addConceptIntersectCount(
    conceptSet = acetaminophen_cs,
    indexDate = "cohort_start_date",
    window = c(0, 30)
  ) %>%
  dplyr::glimpse()
#> Rows: ??
#> Columns: 5
#> Database: DuckDB v1.1.0 [root@Darwin 24.0.0:R 4.4.1//private/var/folders/pl/k11lm9710hlgl02nvzx4z9wr0000gp/T/RtmpREJO98/filedd3e3b0c5924.duckdb]
#> $ cohort_definition_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦
#> $ subject_id                  <int> 865, 236, 867, 1039, 4366, 669, 2273, 2530ā€¦
#> $ cohort_start_date           <date> 1993-04-11, 2000-06-16, 1981-09-14, 1977-ā€¦
#> $ cohort_end_date             <date> 1993-04-25, 2000-07-07, 1981-09-28, 1977-ā€¦
#> $ `161_acetaminophen_0_to_30` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦

Third, we could identify the first start date of acetaminophen in this window.

cdm$ankle_sprain %>%
  addConceptIntersectDate(
    conceptSet = acetaminophen_cs,
    indexDate = "cohort_start_date",
    window = c(0, 30),
    order = "first"
  ) %>%
  dplyr::glimpse()
#> Rows: ??
#> Columns: 5
#> Database: DuckDB v1.1.0 [root@Darwin 24.0.0:R 4.4.1//private/var/folders/pl/k11lm9710hlgl02nvzx4z9wr0000gp/T/RtmpREJO98/filedd3e3b0c5924.duckdb]
#> $ cohort_definition_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦
#> $ subject_id                  <int> 865, 3320, 236, 867, 1039, 4366, 669, 2273ā€¦
#> $ cohort_start_date           <date> 1993-04-11, 1978-02-07, 2000-06-16, 1981-ā€¦
#> $ cohort_end_date             <date> 1993-04-25, 1978-02-21, 2000-07-07, 1981-ā€¦
#> $ `161_acetaminophen_0_to_30` <date> 1993-04-11, 1978-02-07, 2000-06-16, 1981-ā€¦

Or fourth, we can get the number of days to the start date of acetaminophen in the window.

cdm$ankle_sprain %>%
  addConceptIntersectDays(
    conceptSet = acetaminophen_cs,
    indexDate = "cohort_start_date",
    window = c(0, 30),
    order = "first"
  ) %>%
  dplyr::glimpse()
#> Rows: ??
#> Columns: 5
#> Database: DuckDB v1.1.0 [root@Darwin 24.0.0:R 4.4.1//private/var/folders/pl/k11lm9710hlgl02nvzx4z9wr0000gp/T/RtmpREJO98/filedd3e3b0c5924.duckdb]
#> $ cohort_definition_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦
#> $ subject_id                  <int> 865, 3320, 236, 867, 1039, 4366, 669, 2273ā€¦
#> $ cohort_start_date           <date> 1993-04-11, 1978-02-07, 2000-06-16, 1981-ā€¦
#> $ cohort_end_date             <date> 1993-04-25, 1978-02-21, 2000-07-07, 1981-ā€¦
#> $ `161_acetaminophen_0_to_30` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦

1.3 Adding multiple concept based variables

We can add more than one variable at a time when using these functions. For example, we might want to add variables for multiple time windows.

cdm$ankle_sprain %>%
  addConceptIntersectFlag(
    conceptSet = acetaminophen_cs,
    indexDate = "cohort_start_date",
    window = list(
      c(-Inf, -1),
      c(0, 0),
      c(1, Inf)
    )
  ) %>%
  dplyr::glimpse()
#> Rows: ??
#> Columns: 7
#> Database: DuckDB v1.1.0 [root@Darwin 24.0.0:R 4.4.1//private/var/folders/pl/k11lm9710hlgl02nvzx4z9wr0000gp/T/RtmpREJO98/filedd3e3b0c5924.duckdb]
#> $ cohort_definition_id           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦
#> $ subject_id                     <int> 865, 2588, 2666, 2760, 236, 518, 615, 8ā€¦
#> $ cohort_start_date              <date> 1993-04-11, 1987-01-26, 1971-11-21, 19ā€¦
#> $ cohort_end_date                <date> 1993-04-25, 1987-02-16, 1971-12-12, 19ā€¦
#> $ `161_acetaminophen_1_to_inf`   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦
#> $ `161_acetaminophen_minf_to_m1` <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, ā€¦
#> $ `161_acetaminophen_0_to_0`     <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, ā€¦

Or we might want to get variables for multiple drug ingredients of interest.

meds_cs <- getDrugIngredientCodes(
  cdm = cdm,
  name = c(
    "acetaminophen",
    "amoxicillin",
    "aspirin",
    "heparin",
    "morphine",
    "oxycodone",
    "warfarin"
  )
)

cdm$ankle_sprain %>%
  addConceptIntersectFlag(
    conceptSet = meds_cs,
    indexDate = "cohort_start_date",
    window = list(
      c(-Inf, -1),
      c(0, 0)
    )
  ) %>%
  dplyr::glimpse()
#> Rows: ??
#> Columns: 18
#> Database: DuckDB v1.1.0 [root@Darwin 24.0.0:R 4.4.1//private/var/folders/pl/k11lm9710hlgl02nvzx4z9wr0000gp/T/RtmpREJO98/filedd3e3b0c5924.duckdb]
#> $ cohort_definition_id           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā€¦
#> $ subject_id                     <int> 1941, 3177, 5240, 1598, 2936, 2705, 295ā€¦
#> $ cohort_start_date              <date> 2009-04-28, 1924-12-04, 1976-05-23, 19ā€¦
#> $ cohort_end_date                <date> 2009-05-19, 1924-12-18, 1976-06-20, 19ā€¦
#> $ `7052_morphine_minf_to_m1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦
#> $ `7804_oxycodone_minf_to_m1`    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦
#> $ `5224_heparin_minf_to_m1`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦
#> $ `1191_aspirin_minf_to_m1`      <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, ā€¦
#> $ `723_amoxicillin_minf_to_m1`   <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ā€¦
#> $ `161_acetaminophen_minf_to_m1` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ā€¦
#> $ `11289_warfarin_minf_to_m1`    <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ā€¦
#> $ `161_acetaminophen_0_to_0`     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, ā€¦
#> $ `1191_aspirin_0_to_0`          <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, ā€¦
#> $ `723_amoxicillin_0_to_0`       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦
#> $ `11289_warfarin_0_to_0`        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦
#> $ `5224_heparin_0_to_0`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦
#> $ `7052_morphine_0_to_0`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦
#> $ `7804_oxycodone_0_to_0`        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦

1.4 Cohort-based versus concept-based intersections

In the previous vignette we saw that we can add an intersection variable using a cohort we have created. Meanwhile in this vignette we see that we can instead create an intersection variable using a concept set directly. It is important to note that under some circumstances these two approaches can lead to different results.

When creating a cohort we combine overlapping records, as cohort entries cannot overlap. Thus when adding an intersection count, addCohortIntersectCount() will return a count of cohort entries in the window of interest while addConceptIntersectCount() will return a count of records withing the window. We can see the impact for acetaminophen for our example data below, where we have slightly more records than cohort entries.

acetaminophen_cs <- getDrugIngredientCodes(
  cdm = cdm,
  name = c("acetaminophen")
)

cdm <- generateConceptCohortSet(
  cdm = cdm,
  name = "acetaminophen",
  conceptSet = acetaminophen_cs,
  end = "event_end_date",
  limit = "all"
)
#> Warning: ! 3 casted column in acetaminophen (cohort_attrition) as do not match expected
#>   column type:
#> ā€¢ `reason_id` from numeric to integer
#> ā€¢ `excluded_records` from numeric to integer
#> ā€¢ `excluded_subjects` from numeric to integer

dplyr::bind_rows(
  cdm$ankle_sprain |>
    addCohortIntersectCount(
      targetCohortTable = "acetaminophen",
      window = c(-Inf, Inf)
    ) |>
    dplyr::group_by(`161_acetaminophen_minf_to_inf`) |>
    dplyr::tally() |>
    dplyr::collect() |>
    dplyr::arrange(desc(`161_acetaminophen_minf_to_inf`)) |>
    dplyr::mutate(type = "cohort"),
  cdm$ankle_sprain |>
    addConceptIntersectCount(
      conceptSet = acetaminophen_cs,
      window = c(-Inf, Inf)
    ) |>
    dplyr::group_by(`161_acetaminophen_minf_to_inf`) |>
    dplyr::tally() |>
    dplyr::collect() |>
    dplyr::arrange(desc(`161_acetaminophen_minf_to_inf`)) |>
    dplyr::mutate(type = "concept_set")
) |>
  ggplot() +
  geom_col(aes(`161_acetaminophen_minf_to_inf`, n, fill = type),
    position = "dodge"
  ) +
  theme_bw() +
  theme(
    legend.title = element_blank(),
    legend.position = "top"
  )

Additional differences between cohort and concept set intersections may also result from cohort table rules. For example, cohort tables will typically omit any records that occur outside an individualĀ“s observation time (as defined in the observation period window). Such records, however, would not be excluded when adding a concept based intersection.