ePCR
is an R-package intended for the survival analysis of advanced prostate cancer. This document is a basic introduction to the functionality of ePCR
and a general overview to the possible analysis workflows for clinical trial or hospital registry cohorts. The approach leverages ensemble-driven usage of single Cox regression based regression models named ePCR, which was the top performing approach in the DREAM 9.5 Prostate Cancer Challenge (Guinney et al, 2017).
The latest version of ePCR
is available in the Comprehensive R Archive Network CRAN. CRAN mirrors are by default available in the installation of R, and the ePCR
package is installable using the R terminal command: install.packages("ePCR")
. This should prompt the user to select a nearby CRAN mirror, after which the installation of ePCR
and its dependencies are automatically performed. After the install.packages
-call, the ePCR
package can be loaded with either command library("ePCR")
.
The following notation is used in the document: R commands, package names and function names are written in typewriter font
. The notation of format pckgName::funcName
indicates that the function funcName
is called from the package pckgName
, which is prominently used in the underlying R code due to package namespaces. This document as well as other useful PDFs can be inspected using the browseVignettes
function for any package in R.
The ePCR
-package is provided with two example hospital registry datasets. These datasets represent confidential hospital registry cohorts, to which kernel density estimation was fitted. Illustrative virtual patients were then generated from the kernel estimates and are provided here in the example datasets. Please see the accompanying ePCR
publication for further details on the two Turku University Hospital cohorts (Laajala et al., 2018), and the Synapse site for DREAM 9.5 PCC for accessing the original DREAM data (Guinney, Wang, Laajala et al. 2017). The exemplifying datasets can be loaded into an R session using:
library(ePCR)
##
## Attaching package: 'ePCR'
## The following object is masked from 'package:graphics':
##
## plot
## The following object is masked from 'package:base':
##
## plot
# Kernel density simulated patients from Turku University Hospital (TYKS)
# Data consists of TEXT cohort (text-search found patients)
# and MEDI (patients identified using medication and few keywords)
data(TYKSSIMU)
# The following data matrices x and survival responses y become available
head(xTEXTSIMU); head(yTEXTSIMU)
## BMI HEIGHTBL WEIGHTBL ALP ALT AST CA CREAT
## TEXTSIMU1 27.16556 172 83.0 4.852030 3.044522 3.401197 2.305 3.951244
## TEXTSIMU2 27.16556 176 83.0 4.442651 3.258097 3.401197 2.310 4.644391
## TEXTSIMU3 29.35235 168 91.2 4.304065 2.708050 3.401197 2.305 4.394449
## TEXTSIMU4 24.80000 176 83.0 4.442651 2.944439 3.218876 2.330 4.465908
## TEXTSIMU5 27.20000 176 83.0 5.129899 2.944439 3.401197 2.310 3.891820
## TEXTSIMU6 27.16556 176 83.0 4.564348 1.609438 3.401197 2.305 4.204693
## HB LDH NEU PLT PSA TBILI TESTO WBC
## TEXTSIMU1 11.3 5.265247 1.128171 323 3.4657359 2.197225 -0.1743534 2.001480
## TEXTSIMU2 12.6 5.265247 1.329710 216 4.6051702 2.197225 -0.1743534 2.332144
## TEXTSIMU3 13.5 5.265247 2.187174 83 3.8712010 2.197225 -0.1743534 1.856298
## TEXTSIMU4 12.7 5.273000 2.551006 189 0.3364722 3.135494 0.3364722 2.186051
## TEXTSIMU5 12.3 5.265247 1.329710 298 6.6720329 2.197225 -0.1743534 2.041220
## TEXTSIMU6 15.4 5.265247 1.329710 237 3.6505739 2.197225 -0.1743534 1.435085
## CREACL NA. MG PHOS ALB TPRO RBC LYM BUN
## TEXTSIMU1 3.549617 137 -0.210721 0.1397619 34.8 67 4.830 0.3364722 2.475973
## TEXTSIMU2 3.549617 141 -0.210721 0.1397619 34.8 67 4.830 0.3364722 2.475973
## TEXTSIMU3 3.549617 135 -0.210721 0.1397619 29.5 67 4.185 0.3364722 2.397895
## TEXTSIMU4 3.549617 140 -0.210721 0.1397619 34.8 67 3.620 0.3364722 2.475973
## TEXTSIMU5 3.549617 140 -0.210721 0.1397619 34.8 67 4.120 0.3364722 2.475973
## TEXTSIMU6 3.549617 142 -0.210721 0.1397619 34.8 67 3.780 0.3364722 2.475973
## CCRC GLU SYSTOLICBP DIASTOLICBP PULSE HEMAT SPEGRA LYMperLEU
## TEXTSIMU1 3.703478 1.824549 136 76 72 0.43 0 24
## TEXTSIMU2 3.703478 1.840550 142 64 72 0.45 0 22
## TEXTSIMU3 3.703478 1.856298 111 76 72 0.38 0 22
## TEXTSIMU4 3.703478 1.856298 128 76 72 0.38 0 22
## TEXTSIMU5 3.703478 1.757858 142 76 69 0.38 0 22
## TEXTSIMU6 3.703478 1.856298 151 76 72 0.38 0 22
## MONO MONOperLEU NEUperLEU POT BASOperLEU EOS EOSperLEU TARGET
## TEXTSIMU1 0.62 9 63 4.1 1 0.17 0 0
## TEXTSIMU2 0.62 9 63 4.1 0 0.17 1 0
## TEXTSIMU3 0.62 9 63 4.1 0 0.19 2 0
## TEXTSIMU4 0.62 9 63 4.9 0 0.17 2 0
## TEXTSIMU5 0.62 9 63 3.7 0 0.17 2 0
## TEXTSIMU6 0.62 9 63 3.7 0 0.17 2 0
## LYMPH_NODES KIDNEYS LUNGS LIVER PLEURA OTHER PROSTATE ORCHIDECTOMY
## TEXTSIMU1 0 0 0 0 0 0 0 1
## TEXTSIMU2 0 0 0 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0 0 0 0
## TEXTSIMU4 0 0 0 0 0 1 0 0
## TEXTSIMU5 0 0 0 1 0 0 0 0
## TEXTSIMU6 1 0 0 0 0 1 0 0
## PROSTATECTOMY LYMPHADENECTOMY BILATERAL_ORCHIDECTOMY
## TEXTSIMU1 1 0 1
## TEXTSIMU2 0 0 0
## TEXTSIMU3 0 0 0
## TEXTSIMU4 0 0 0
## TEXTSIMU5 0 0 0
## TEXTSIMU6 0 0 0
## PRIOR_RADIOTHERAPY ANALGESICS ANTI_ANDROGENS GLUCOCORTICOID
## TEXTSIMU1 1 0 0 0
## TEXTSIMU2 1 1 0 1
## TEXTSIMU3 1 0 0 0
## TEXTSIMU4 0 0 0 0
## TEXTSIMU5 0 0 0 1
## TEXTSIMU6 1 0 0 0
## GONADOTROPIN BISPHOSPHONATE CORTICOSTEROID IMIDAZOLE ACE_INHIBITORS
## TEXTSIMU1 0 0 0 0 0
## TEXTSIMU2 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0
## TEXTSIMU4 0 0 0 0 0
## TEXTSIMU5 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0
## BETA_BLOCKING HMG_COA_REDUCT ESTROGENS ANTI_ESTROGENS CEREBACC CHF
## TEXTSIMU1 0 0 0 0 0 0
## TEXTSIMU2 0 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0 0
## TEXTSIMU4 0 0 0 0 0 1
## TEXTSIMU5 0 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0 0
## DVT DIAB MI PULMEMB SPINCOMP COPD MHBLOOD MHCARD MHCONGEN MHEAR
## TEXTSIMU1 0 0 0 0 0 0 0 1 0 0
## TEXTSIMU2 0 0 0 0 0 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0 0 0 0 0 0
## TEXTSIMU4 0 1 0 0 0 0 0 0 0 0
## TEXTSIMU5 0 0 0 0 0 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0 0 0 0 0 1
## MHENDO MHGASTRO MHHEPATO MHIMMUNE MHINFECT MHINJURY MHINVEST MHMETAB
## TEXTSIMU1 0 0 0 0 0 1 0 0
## TEXTSIMU2 0 1 0 0 0 0 0 0
## TEXTSIMU3 1 1 0 0 1 0 0 0
## TEXTSIMU4 0 1 0 0 0 0 0 0
## TEXTSIMU5 0 0 0 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0 0 0 0
## MHPSYCH MHRENAL MHRESP MHSKIN MHVASC ECOG_C AGEGRP2 RaceAsian
## TEXTSIMU1 0 0 0 0 0 0 2 0
## TEXTSIMU2 0 0 0 0 1 0 0 0
## TEXTSIMU3 0 0 0 0 0 0 1 0
## TEXTSIMU4 0 0 0 0 0 0 1 0
## TEXTSIMU5 0 0 0 0 0 0 1 0
## TEXTSIMU6 0 0 0 0 0 0 2 0
## RaceBlack RaceOther RaceWhite RegionAsia RegionEastEuro
## TEXTSIMU1 0 0 0 0 0
## TEXTSIMU2 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0
## TEXTSIMU4 0 0 0 0 0
## TEXTSIMU5 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0
## RegionNorthAmer RegionSouthAmer RegionWestEuro
## TEXTSIMU1 0 0 0
## TEXTSIMU2 0 0 0
## TEXTSIMU3 0 0 0
## TEXTSIMU4 0 0 0
## TEXTSIMU5 0 0 0
## TEXTSIMU6 0 0 0
## DEATH LKADT_P surv
## TEXTSIMU1 1 342 342
## TEXTSIMU2 0 360 360+
## TEXTSIMU3 1 682 682
## TEXTSIMU4 0 1067 1067+
## TEXTSIMU5 1 113 113
## TEXTSIMU6 0 1246 1246+
head(xMEDISIMU); head(yMEDISIMU)
## BMI HEIGHTBL WEIGHTBL ALP ALT AST CA CREAT
## MEDISIMU1 28.04282 175 90 5.093750 2.708050 3.349750 1.99 4.488636
## MEDISIMU2 26.57313 176 60 5.017280 3.091042 3.258097 2.41 4.174387
## MEDISIMU3 28.39506 165 65 4.418841 3.332205 3.349750 2.41 4.077537
## MEDISIMU4 24.57787 176 107 5.003946 3.295837 3.349750 2.33 4.634729
## MEDISIMU5 30.58581 188 73 4.158883 2.484907 3.367296 2.34 4.234107
## MEDISIMU6 25.18079 174 86 4.564348 4.882802 3.349750 2.33 4.499810
## HB LDH NEU PLT PSA TBILI TESTO WBC
## MEDISIMU1 10.9 5.327876 1.2149127 186 6.194405 1.386294 -0.08338161 1.609438
## MEDISIMU2 13.3 5.327876 0.7030975 156 2.163323 1.609438 -0.08338161 2.041220
## MEDISIMU3 11.8 5.327876 1.0952734 126 3.713572 1.609438 -0.99425227 1.871802
## MEDISIMU4 13.1 5.327876 0.4946962 217 3.555348 1.791759 -0.08338161 1.568616
## MEDISIMU5 15.3 5.327876 1.1939225 221 3.367296 2.079442 0.78845736 1.704748
## MEDISIMU6 12.8 5.327876 1.9892433 386 3.610918 1.791759 -1.56064775 1.824549
## CREACL NA. MG PHOS ALB TPRO RBC LYM BUN
## MEDISIMU1 0 140 -0.1923903 0.09531018 36.65 68.5 3.91 0.1823216 1.722767
## MEDISIMU2 0 142 -0.1923903 -0.02020271 33.60 68.5 4.28 0.8878913 1.722767
## MEDISIMU3 0 144 -0.1923903 -0.02020271 36.65 68.5 4.62 0.4946962 1.722767
## MEDISIMU4 0 142 -0.1923903 -0.02020271 36.65 69.0 4.35 -0.2744368 1.722767
## MEDISIMU5 0 143 -0.1923903 -0.06187540 36.65 68.5 4.05 0.4946962 1.722767
## MEDISIMU6 0 137 -0.1923903 -0.02020271 36.65 68.5 4.78 0.5128236 1.722767
## CCRC GLU SYSTOLICBP DIASTOLICBP PULSE HEMAT SPEGRA LYMperLEU
## MEDISIMU1 3.800105 1.435085 141.5 77 68 0.37 0 29
## MEDISIMU2 3.746038 1.871802 107.0 77 58 0.34 0 29
## MEDISIMU3 3.800105 1.916923 141.5 77 71 0.43 0 29
## MEDISIMU4 3.800105 1.871802 126.0 90 71 0.40 0 29
## MEDISIMU5 3.800105 1.589235 141.5 77 71 0.35 0 28
## MEDISIMU6 3.800105 1.791759 188.0 77 88 0.38 0 29
## MONO MONOperLEU NEUperLEU POT BASOperLEU EOS EOSperLEU TARGET
## MEDISIMU1 0.60 11 56.5 4.4 0 0.17 3 0
## MEDISIMU2 0.60 11 56.5 4.5 1 0.17 3 0
## MEDISIMU3 0.60 11 56.5 3.7 0 0.17 7 0
## MEDISIMU4 0.88 11 56.5 4.1 0 0.17 3 0
## MEDISIMU5 0.60 11 56.5 4.6 0 0.17 3 0
## MEDISIMU6 0.60 11 56.5 4.0 0 0.17 3 0
## LYMPH_NODES KIDNEYS LUNGS LIVER PLEURA OTHER PROSTATE ORCHIDECTOMY
## MEDISIMU1 0 0 0 0 0 1 0 0
## MEDISIMU2 0 0 0 0 0 0 0 0
## MEDISIMU3 0 0 0 0 0 0 0 0
## MEDISIMU4 1 0 0 0 0 0 0 1
## MEDISIMU5 0 0 0 0 0 1 0 0
## MEDISIMU6 0 0 0 0 0 0 0 0
## PROSTATECTOMY LYMPHADENECTOMY BILATERAL_ORCHIDECTOMY
## MEDISIMU1 0 0 0
## MEDISIMU2 0 0 0
## MEDISIMU3 1 0 0
## MEDISIMU4 0 0 0
## MEDISIMU5 0 0 0
## MEDISIMU6 0 0 0
## PRIOR_RADIOTHERAPY ANALGESICS ANTI_ANDROGENS GLUCOCORTICOID
## MEDISIMU1 1 0 0 0
## MEDISIMU2 1 1 0 0
## MEDISIMU3 1 0 1 1
## MEDISIMU4 0 1 1 1
## MEDISIMU5 1 0 1 1
## MEDISIMU6 1 0 1 1
## GONADOTROPIN BISPHOSPHONATE CORTICOSTEROID IMIDAZOLE ACE_INHIBITORS
## MEDISIMU1 0 1 1 0 0
## MEDISIMU2 0 0 1 0 0
## MEDISIMU3 0 0 1 0 0
## MEDISIMU4 0 0 1 0 0
## MEDISIMU5 0 0 0 0 0
## MEDISIMU6 0 0 1 0 0
## BETA_BLOCKING HMG_COA_REDUCT ESTROGENS ANTI_ESTROGENS CEREBACC CHF
## MEDISIMU1 0 1 0 0 0 0
## MEDISIMU2 0 0 0 0 0 0
## MEDISIMU3 1 0 0 0 0 0
## MEDISIMU4 0 0 0 0 0 0
## MEDISIMU5 0 0 0 0 0 0
## MEDISIMU6 1 0 0 0 0 0
## DVT DIAB MI PULMEMB SPINCOMP COPD MHBLOOD MHCARD MHCONGEN MHEAR
## MEDISIMU1 0 0 0 0 0 0 0 1 0 0
## MEDISIMU2 0 0 0 0 0 0 0 1 0 1
## MEDISIMU3 1 0 0 0 0 0 0 1 0 1
## MEDISIMU4 0 1 0 0 0 0 0 0 0 0
## MEDISIMU5 0 0 0 0 0 0 0 0 0 0
## MEDISIMU6 0 1 0 0 0 0 0 0 0 0
## MHENDO MHGASTRO MHHEPATO MHIMMUNE MHINFECT MHINJURY MHINVEST MHMETAB
## MEDISIMU1 0 1 0 0 1 0 0 0
## MEDISIMU2 0 0 0 0 0 1 0 1
## MEDISIMU3 0 0 0 0 0 0 0 0
## MEDISIMU4 0 0 0 0 0 0 0 1
## MEDISIMU5 0 0 0 0 0 0 0 0
## MEDISIMU6 0 0 0 0 0 0 0 0
## MHPSYCH MHRENAL MHRESP MHSKIN MHVASC ECOG_C AGEGRP2 RaceAsian
## MEDISIMU1 0 0 0 0 0 0 2 0
## MEDISIMU2 0 0 0 0 0 0 1 0
## MEDISIMU3 0 0 0 0 0 0 2 0
## MEDISIMU4 0 0 0 1 0 0 2 0
## MEDISIMU5 0 1 0 0 0 0 0 0
## MEDISIMU6 0 0 0 0 0 0 2 0
## RaceBlack RaceOther RaceWhite RegionAsia RegionEastEuro
## MEDISIMU1 0 0 0 0 0
## MEDISIMU2 0 0 0 0 0
## MEDISIMU3 0 0 0 0 0
## MEDISIMU4 0 0 0 0 0
## MEDISIMU5 0 0 0 0 0
## MEDISIMU6 0 0 0 0 0
## RegionNorthAmer RegionSouthAmer RegionWestEuro
## MEDISIMU1 0 0 0
## MEDISIMU2 0 0 0
## MEDISIMU3 0 0 0
## MEDISIMU4 0 0 0
## MEDISIMU5 0 0 0
## MEDISIMU6 0 0 0
## DEATH LKADT_P surv
## MEDISIMU1 0 89 89+
## MEDISIMU2 1 754 754
## MEDISIMU3 1 783 783
## MEDISIMU4 0 159 159+
## MEDISIMU5 0 1322 1322+
## MEDISIMU6 1 200 200
library(survival)
It is important to disginguish between the PSP
and PEP
objects, which represent a single penalized Cox regression model and an ensemble of Cox regression models, respectively. PSP
objects are penalized/regularized Cox regression models fitted to a particular dataset by exploring its \(\{\lambda, \alpha\}\) parameter space. Notice that the sequence of \(\lambda\) is dependent on the \(\alpha \in [0,1]\). The regularized/penalized fitting procedure in ePCR
is provided by the glmnet
-package (Simon et al., 2011), although custom cross-validation and other supporting functionality is provided independently.
After fitting suitable candidate PSP
-objects (Penalized Single Predictors), these will be aggregated to the ensemble structure PEP
(Penalized Ensemble Predictor). The key input to PEP
-constructor are the PSP
intended for the use of the ensemble. We will start off by introducing the fine-tuning and fitting of PSP
s. For this purpose the generic S4-class contructor new
will be called with the main parameter indicating that we wish to construct a PSP
-object.
The key attributes provided for the PSP-constructor are the following parameters (see ?'PSP-class'
in R for further documentation):
x
: The input data matrix where rows corresponding to patients and columns to potential predictors.y
: The Surv
-class response vector as required by Cox regression and glmnet
in survival prediction.seeds
: An integer vector or a single value for setting the random seed for cross-validation. Notice that this is highly suggested for reproducibility. If a multiple seed integers are provided, the cross-validation will be conducted separately for each. This will smoothen the cross-validation surface, but will take multiply the computational time required to fit a model.score
: The scoring function utilized in evaluating the generalization ability of the fitted model in cross-validation; readily implemented scoring functions include score.iAUC
and score.cindex
, but custom scoring functions are also allowed.alphaseq
: Sequence of alpha values. The extreme ends \(\alpha = 1\) is LASSO regression and \(\alpha = 0\) is Ridge Regression. \(\alpha \in ]0,1[\) is generally referred to as Elastic Net. Notice that LASSO and Ridge Regression have noticeably different characteristics as they utilizeo only the \(L_1\) and \(L_2\) norms, respectively; for example, a Ridge Regression model will never have its coefficients exactly zero. Furthermore, for co-linear predictors LASSO tends to pick a single one, while Ridge Regression picks multiple ones and spreads the overall effect over these predictors. Depending on the ultimate prediction purpose, one may prefer one or the other and can tailor alphaseq
to suit their needs. By default we suggest utilizing an evenly spaced alphaseq
over \([0,1]\) at least for preliminary search.nlambda
: Number of \(\lambda\) tested as a function of the corresponding \(\alpha\). By default glmnet
suggests 100 values which are picked from a feasible range between model including all coefficients and converged model where no further penalization is possible.folds
: Number of folds in the cross-validation (minimum 3, maximum n obs = LOO-CV).For the sake of the example, we will construct an ePCR
model ensemble that consists of two PSP
-objects; one from the medication curated cohort and other from the text search cohort. We will leave out a small portion of medication and text search patients for a small test set, to later evaluate the generalization ability of the ensemble. Notice however that this is not a proper evaluation as the patients are not from an independent source, and therefore give an optimistic view to the generalization capability of the model(s).
1:30
testset <-# Medication cohort fit
# Leaving out patients into a separate test set using negative indices
new("PSP",
psp_medi <-# Input data matrix x (example data loaded previously)
x = xMEDISIMU[-testset,],
# Response vector, 'surv'-object
y = yMEDISIMU[-testset,"surv"],
# Seeds for reproducibility
seeds = c(1,2),
# If user wishes to run the CV binning multiple times,
# this is possible by averaging over them for smoother CV heatmap.
cvrepeat = 2,
# Using the concordance-index as prediction accuracy in CV
score = score.cindex,
# Alpha sequence
alphaseq = seq(from=0, to=1, length.out=6),
# Using glmnet's default nlambda of 100
nlambda = 100,
# Running the nominal 10-fold cross-validation
folds = 10,
# x.expand slot is a function that would allow interaction terms
# For the sake of the simplicity we will consider identity function
x.expand = function(x) { as.matrix(x) }
)
## --- Initializing new PSP object ---
##
## --- Cross-validation ( 10 -folds) repeat run 1 of 2 ---
##
## [1] "alpha 0"
## [1] "alpha 0.2"
## [1] "alpha 0.4"
## [1] "alpha 0.6"
## [1] "alpha 0.8"
## [1] "alpha 1"
## --- Cross-validation ( 10 -folds) repeat run 2 of 2 ---
##
## [1] "alpha 0"
## [1] "alpha 0.2"
## [1] "alpha 0.4"
## [1] "alpha 0.6"
## [1] "alpha 0.8"
## [1] "alpha 1"
## --- Computing AUCs for regularization curves for coefficients ---
##
## --- Generating feature list and dictionary ---
##
## --- New PSP object successfully created ---
The parameters for the second PSP
are similar to the one above. Notice that with the PSP
-members, user can tailor multiple parameters to best suit the data.
# Text run similar to above
# Leaving out patients into a separate test set using negative indices
new("PSP",
psp_text <-x = xTEXTSIMU[-testset,],
y = yTEXTSIMU[-testset,"surv"],
seeds = c(3,4),
cvrepeat = 2,
score = score.cindex,
alphaseq = seq(from=0, to=1, length.out=6),
nlambda = 100,
folds = 10,
x.expand = function(x) { as.matrix(x) }
)
## --- Initializing new PSP object ---
##
## --- Cross-validation ( 10 -folds) repeat run 1 of 2 ---
##
## [1] "alpha 0"
## [1] "alpha 0.2"
## [1] "alpha 0.4"
## [1] "alpha 0.6"
## [1] "alpha 0.8"
## [1] "alpha 1"
## --- Cross-validation ( 10 -folds) repeat run 2 of 2 ---
##
## [1] "alpha 0"
## [1] "alpha 0.2"
## [1] "alpha 0.4"
## [1] "alpha 0.6"
## [1] "alpha 0.8"
## [1] "alpha 1"
## --- Computing AUCs for regularization curves for coefficients ---
##
## --- Generating feature list and dictionary ---
##
## --- New PSP object successfully created ---
# Taking a look on the show-method for PSP:
psp_medi
## PSP ePCR object
## N observations: 120
## Optimal alpha: 1
## Optimal lambda: 0.2578574
## Optimal lambda index: 1
# Plot the CV-surface of the fitted PSP:
plot(psp_medi,
# Showing only every 10th row and column name (propagated to heatcv-function)
by.rownames=10, by.colnames=10,
# Adjust main title and tilt the bias of the color key legend (see ?heatcv)
main="C-index CV for psp_medi", bias=0.2)
Noticeably, the cross-validation surface suggests different optimized penalization parameters for the two ensemble members. This most likely stems from systematic differences in the two cohorts, to which end the ePCR
methodology offers an ensemble-driven alternative to account for differences between patient substrata.
plot(psp_text,
# Showing only every 10th row and column name (propagated to heatcv-function)
by.rownames=10, by.colnames=10,
# Adjust main title and tilt the bias of the color key legend (see ?heatcv)
main="C-index CV for psp_text", bias=0.2)
In addition to providing the CV-grid, the identified optimal parameters are available for downstream analyses:
@optimum psp_medi
## Alpha AlphaIndex Lambda LambdaIndex
## 1.0000000 6.0000000 0.2578574 1.0000000
@optimum psp_text
## Alpha AlphaIndex Lambda LambdaIndex
## 1.0000000 6.0000000 0.4396716 1.0000000
slotNames(psp_medi)
## [1] "description" "features" "strata" "alphaseq" "cvfolds"
## [6] "nlambda" "cvmean" "cvmedian" "cvstdev" "cvmin"
## [11] "cvmax" "score" "cvrepeat" "impute" "optimum"
## [16] "seed" "x" "x.expand" "y" "fit"
## [21] "criterion" "dictionary" "regAUC"
Once the PSP
-objects have been constructed, they are aggregated to the corresponding Penalized Ensemble Predictor (PEP). The PEP
objects aggregate PSP
objects from various data slices or optimization criteria, and create an ensemble predictor that averages over the provided single predictors. As such, its most important input is the list of desired PSP
-objects:
new("PEP",
pep_tyks <-# The main input is the list of PSP objects
PSPs = list(psp_medi, psp_text)
)# These PSPs were constructed using the example code above.
pep_tyks
## Penalized Ensemble Predictor
## Count of PSPs: 2
# Conduct naive test set evaluation
rbind(xMEDISIMU[testset,], xTEXTSIMU[testset,])
xtest <- rbind(yMEDISIMU[testset,], yTEXTSIMU[testset,])
ytest <-# Perform survival prediction based on the PEP-ensemble we've created
predict(pep_tyks, newx=as.matrix(xtest), type="ensemble")
xpred <-# Construct a survival object using the Surv-class
Surv(time = ytest[,"surv"][,"time"], event = ytest[,"surv"][,"status"])
ytrue <-# Test c-index between our constructed ensemble prediction and true response
score.cindex(pred = xpred, real = ytrue)
tyksscore <-print(paste("TYKS example c-index:", round(tyksscore, 4)))
## [1] "TYKS example c-index: 0.5"
The ePCR
R-package comes with readily fitted ePCR
-ensembles from the work by (Guinney, Wang, Laajala et al. 2017) as well as from hospital registry cohorts. Due to data confidentiality issues, the original data matrices or responses are not provided in the S4-objects (although normally they would be in the slots @x
and @y
, respectively).
In order to gain access to the original data by Guinney et al., the processed data can be accessed as raw .csv
files or R workspaces at the corresponding Synapse workspace.
Accessing the Turku University Hospital registry cohort requires a research permit and users are encouraged to contact the Center for Clinical Informatics (Arho.Virkki@tyks.fi) for further information.
Despite not providing the original data matrices, the ensemble model fits and their coefficients as a function of \(\{\lambda, \alpha\}\) are fully functional. They are therefore suitable for conducting predictions for future patients or for studying effect within the estimated models/ensembles. These model objects can be loaded in ePCR
using:
data(ePCRmodels)
class(DREAM)
## [1] "PEP"
## attr(,"package")
## [1] ".GlobalEnv"
class(TYKS)
## [1] "PEP"
## attr(,"package")
## [1] "ePCR"
The DREAM
S4-object is the top-performing mCRPC OS-predicting ensemble from Guinney et al., while the TYKS models are fitted to the original Turku University Hospital cohorts. These model objects can be used for prediction similarly to the novel S4 PEP
-object created in above sections. As an example, if we utilize the DREAM model trained on controlled clinical trials on the TYKS hospital registry patients, the OS prediction can be conducted using:
# Create a DREAM-matching data input matrix from our xtest and the full data matrix
conforminput(DREAM, xtest)
xtemp <-# Predict survival for our hospital registry example dataset
predict(DREAM,
dreampred <-# Providing full new data and average prediction over the ensemble members
newx=xtemp, type="ensemble",
# Defining that we don't want any further data matrix feature extraction
# The call to conforminput above already formatted the input data
x.expand = as.matrix
)
Notice that we utilize the helper function conforminput
for feature extraction/creation, as multiple interaction variables were introduced in the original DREAM data matrix and the dimensions would not match in the regression task otherwise.
The following error message is quite commonly encountered when first using pre-built models to new data:
Error in newx %*% nbeta : Cholmod error ‘X and/or Y have wrong dimensions’ at file ../MatrixOps/cholmod_sdmult.c, line 90
It is prompted by the glmnet
-package’s C/Fortran implementation, if the \(\beta\) coefficients do not conform to the provided dimensions of the new data matrix \(X\). For this purpose, the new data should have equal number of columns (variables) using data processing (functions such as conforminput
or the S4-slot in a PEP
-object called x.expand
).
# Test c-index between the DREAM ensemble prediction and TYKS true response
score.cindex(pred = dreampred, real = ytrue)
dreamscore <-print(paste("DREAM example c-index:", round(dreamscore, 4)))
## [1] "DREAM example c-index: 0.389"
sessionInfo()
## R Under development (unstable) (2023-09-30 r85239 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=English_Finland.utf8
## [3] LC_MONETARY=English_Finland.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Finland.utf8
##
## time zone: Europe/Helsinki
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] survival_3.5-7 ePCR_0.11.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.6-1.1 glmnet_4.1-8 future.apply_1.11.0
## [4] jsonlite_1.8.7 compiler_4.4.0 Rcpp_1.0.11
## [7] parallel_4.4.0 jquerylib_0.1.4 globals_0.16.2
## [10] splines_4.4.0 yaml_2.3.7 fastmap_1.1.1
## [13] lattice_0.21-8 prodlim_2023.03.31 impute_1.75.1
## [16] Bolstad2_1.0-29 R6_2.5.1 shape_1.4.6
## [19] knitr_1.44 iterators_1.0.14 pec_2023.04.12
## [22] future_1.33.0 bslib_0.5.0 rlang_1.1.1
## [25] cachem_1.0.8 xfun_0.39 sass_0.4.7
## [28] cli_3.6.1 hamlet_0.9.6 digest_0.6.33
## [31] foreach_1.5.2 grid_4.4.0 mvtnorm_1.2-2
## [34] lava_1.7.2.1 timereg_2.0.5 timeROC_0.4
## [37] evaluate_0.21 pracma_2.4.2 data.table_1.14.8
## [40] numDeriv_2016.8-1.1 listenv_0.9.0 codetools_0.2-19
## [43] parallelly_1.36.0 rmarkdown_2.23 tools_4.4.0
## [46] htmltools_0.5.5