A Personalized Disease Network (PDN) is a connected graph that represents the disease evolution of an individual patient. We use diagnoses, medical interventions, and their dates to build a PDN. Each node in the PDN represents an event (e.g. Heart failure) and each node is connected to another by a directed edge if the dates fulfill a certain criterion. PDN were introduced by Javier Cabrera and proved to be effective for improving goodness of fit of survival models, dominating individual comorbidity variables. (Cabrera et. al. 2016)
Like many other R packages, the simplest way to obtain glmnet is to install it directly from CRAN. Type the following command in R console:
install.packages("PDN", repos = "https://cran.us.r-project.org")
Users may change the repos options depending on their locations and preferences. Other options such as the directories where to install the packages can be altered in the command. For more details, see help(install.packages). Here the R package has been downloaded and installed to the default directories. Alternatively, users can download the package source at “https://cran.r-project.org/package=PDN” and type Unix commands to install it to the desired location.
The purpose of this section is to give users a general sense of the package, including the components, what they do and some basic usage. We will briefly go over the main functions, see the basic operations and have a look at the outputs. Users may have a better idea after this section what functions are available, which one to choose, or at least where to seek help. More details are given in later sections.
First, we load the glmnet package:
library(PDN)
The package contains five functions: buildnetworks, datecut, draw.PDN, draw.PDN.circle and draw.PDN.cluster, and three sample data sets: comorbidity_data, demographic_data and survival_data.
You can visit the help page of the package using the following command:
help(package = "PDN")
For comorbidity_data, its columns represent different diagnoses, its rows represent different patients, and the entries represent the time difference between the respective event and January 1st 1970. NA means the patient does not have the respective diagnosis.
summary(comorbidity_data)
## AFIB HF MR CMTHY
## Min. :13100 Min. :12796 Min. :12925 Min. :12876
## 1st Qu.:15882 1st Qu.:15018 1st Qu.:14979 1st Qu.:15915
## Median :16954 Median :16546 Median :17043 Median :17238
## Mean :16729 Mean :16300 Mean :16467 Mean :16770
## 3rd Qu.:17854 3rd Qu.:17739 3rd Qu.:17740 3rd Qu.:18168
## Max. :19097 Max. :19097 Max. :19191 Max. :19097
## NA's :63 NA's :57 NA's :42
## CHD HTN DM NEO
## Min. :12794 Min. :12794 Min. :12794 Min. :13625
## 1st Qu.:13151 1st Qu.:13262 1st Qu.:13590 1st Qu.:14946
## Median :13602 Median :13866 Median :14474 Median :17517
## Mean :14209 Mean :14319 Mean :14953 Mean :16801
## 3rd Qu.:14867 3rd Qu.:15266 3rd Qu.:15998 3rd Qu.:18526
## Max. :18837 Max. :18528 Max. :18263 Max. :18900
## NA's :3 NA's :1 NA's :43 NA's :71
## COPD RENAL
## Min. :12829 Min. :13166
## 1st Qu.:13944 1st Qu.:16932
## Median :16062 Median :17821
## Mean :15916 Mean :17407
## 3rd Qu.:17955 3rd Qu.:18405
## Max. :19130 Max. :19233
## NA's :30 NA's :31
For demographic_data, it contains the demographic information of each patients It has five columns which are sex, race, hispan, dshyr and prime.
summary(demographic_data)
## SEX RACE HISPAN DSHYR PRIME
## F:40 1 :66 0 :87 Min. :1995 Min. : 11.00
## M:60 21063 :14 9 :10 1st Qu.:2001 1st Qu.: 11.00
## 2 :13 2 : 2 Median :2005 Median : 11.00
## 0 : 2 5 : 1 Mean :2004 Mean : 41.84
## 9 : 2 1 : 0 3rd Qu.:2008 3rd Qu.: 32.00
## 21311 : 1 3 : 0 Max. :2012 Max. :303.00
## (Other): 2 (Other): 0
For survival_data, it contains information of survival days from admission to death, as well as censor information regarding whether the patient is dead or not.
summary(survival_data)
## surdays event
## Min. : 69 Min. :0.00
## 1st Qu.:1300 1st Qu.:0.00
## Median :2466 Median :0.00
## Mean :2817 Mean :0.43
## 3rd Qu.:4213 3rd Qu.:1.00
## Max. :6562 Max. :1.00
In order to create a PDN based on comorbidity_data, we need to find the criterion for each pair of diagnoses that define the connection of the network.
Generally, it is important to choose an appropriate cutoff point such that it captures information of the interaction between the two diagnoses.
For CMTHY and HTN, the distribution of the time difference is shown below:
We used Cox PH regression to find the best criterion for cutoff point following the rule proposed by (Cabrera et. al. 2016).
k1 = datecut(comorbidity_data,survival_data[,1],survival_data[,2])
k1[1:20]
## AFIB to HF HF to AFIB AFIB to MR MR to AFIB AFIB to CMTHY
## 6000 6000 1000 1000 900
## CMTHY to AFIB AFIB to CHD CHD to AFIB AFIB to HTN HTN to AFIB
## 900 5500 5500 5500 5500
## AFIB to DM DM to AFIB AFIB to NEO NEO to AFIB AFIB to COPD
## 1400 1400 4700 4700 3300
## COPD to AFIB AFIB to RENAL RENAL to AFIB HF to MR MR to HF
## 3300 1900 1900 5300 5300
By using function buildnetworks, we past comorbidity_data and criterion information we get from the datecut into the function, then we will get the adjacency matrix for the network shown below:
a = buildnetworks(comorbidity_data,k1)
a[1:10,1:10]
## AFIB to HF HF to AFIB AFIB to MR MR to AFIB AFIB to CMTHY
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 1 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
## [6,] 0 0 0 0 0
## [7,] 0 0 0 0 0
## [8,] 0 0 0 0 0
## [9,] 0 0 0 0 0
## [10,] 0 1 0 0 1
## CMTHY to AFIB AFIB to CHD CHD to AFIB AFIB to HTN HTN to AFIB
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 1 0 1
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
## [6,] 0 0 0 0 0
## [7,] 0 0 0 0 0
## [8,] 0 0 0 0 0
## [9,] 0 0 0 0 0
## [10,] 0 1 1 0 1
In our case, the nodes have chronological order, and we opt to incorporate this order in the network visualization for a more informative and clean result. Below left is the example PDN from function draw.PDN.circle.
PDN for individual patient:
Here is the example using the functioin draw.PDN, which utilize network and ggplot2 for better visualization:
By looping each patient through draw.PDN.circle, we obtain the visualizations for the set of patients: Here is an example of using draw.PDN.cluster to visulaize network of cluster, we use different colors to represent weighted connection of each nodes.
## Loading required package: survival
In the below example using PDN network information to perform regression, we combine comorbidity information and the adjacency matrix from buildnetworks as our design matrix which is shown below:
x = cbind(!is.na(comorbidity_data),a)
x[1:10,1:10]
## AFIB HF MR CMTHY CHD HTN DM NEO COPD RENAL
## 1 0 1 1 1 1 1 1 0 1 1
## 2 0 1 1 0 1 1 0 0 1 1
## 3 1 1 0 0 1 1 1 0 1 1
## 4 0 1 1 0 1 1 1 1 1 1
## 5 0 1 0 1 1 1 1 1 1 1
## 6 0 1 0 0 1 1 1 0 0 0
## 7 0 1 0 1 1 1 1 0 1 1
## 8 0 1 0 0 1 1 0 1 1 0
## 9 0 1 0 0 1 1 0 1 1 1
## 10 1 1 0 1 1 1 1 0 0 1
We then use the glmnet package to perform k-fold cross-validation for the Cox model. We extract two optimal lambdas, that is lambda.min: the ?? at which the minimal MSE is achieved, and lambda.1se: the largest ?? at which the MSE is within one standard error of the minimal MSE. We can check the active covariates in our model and see their coefficients based on the lambda we have chosen.
require(survival)
require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-10
y=Surv(survival_data[,1],survival_data[,2])
glm1 = cv.glmnet(x,y,family="cox",alpha=0.8)
b = glmnet(x,y,family="cox",alpha=0.8,lambda=c(glm1$lambda.min,glm1$lambda.1se))$beta
b[1:20,]
## 20 x 2 sparse Matrix of class "dgCMatrix"
## s0 s1
## AFIB . .
## HF . .
## MR . .
## CMTHY . .
## CHD . .
## HTN . .
## DM . -0.52194109
## NEO . .
## COPD . .
## RENAL 0.42446 0.54860645
## AFIB to HF . .
## HF to AFIB . .
## AFIB to MR . .
## MR to AFIB . .
## AFIB to CMTHY . .
## CMTHY to AFIB . .
## AFIB to CHD . -0.03440171
## CHD to AFIB . .
## AFIB to HTN . .
## HTN to AFIB . .
Through stepwise model selection based on the AIC criterion, we can see that the network variables dominate individual comorbidity variables. We conclude that the network information is significant for improving the model compared to just comorbidity information.
isel = b[,2]!=0
bcox = coxph(y~.,data=data.frame(x[,isel]))
step1 = step(bcox)
step1$anova