Main function TriadSim
TriadSim
is the main function to perform the simulations. The example function call below simulates genotype data for 1000 case-parent triads for 4 chromosomes (chromsomes 1, 8 17, 20) under a genetic main effect scenario with a baseline disease prevalence of P0=0.001 and genetic relative risks of 1.5 and 2 for carrying the first and the second pathway respectively. This function call will write output files in PLINK. The output file names and path to the directory are given by the parameter “out.put.file” and the chromosome number. Each set (.bim, .bed and .fam files) of PLINK files contain genotype data for one chromosome for all simulated samples. The name of the file is the concatenation of the value of the input parameter “out.put.file” and chromosome number. For example, if “out.put.file” is set to be “triad”, the names of the output files will be triad1, triad8, triad17 and triad20 for our example. See R package documentation for more details.
library(TriadSim)
m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
k.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_kid')
input.plink.file <- c(m.file, f.file, k.file)
TriadSim(input.plink.file, out.put.file=file.path(tempdir(),'triad'), fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
## [1] 21 118 121 140 155 168 218 383
## coercing object of mode numeric to SnpMatrix
## Writing FAM file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad1.fam
## Writing extended MAP file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad1.bim
## Writing BED file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad1.bed (SNP-major mode)
## coercing object of mode numeric to SnpMatrix
## Writing FAM file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad8.fam
## Writing extended MAP file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad8.bim
## Writing BED file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad8.bed (SNP-major mode)
## coercing object of mode numeric to SnpMatrix
## Writing FAM file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad17.fam
## Writing extended MAP file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad17.bim
## Writing BED file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad17.bed (SNP-major mode)
## coercing object of mode numeric to SnpMatrix
## Writing FAM file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad20.fam
## Writing extended MAP file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad20.bim
## Writing BED file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad20.bed (SNP-major mode)
The following call simulates a quantitative trait (by setting “qtl=T”). The function will create 4 sets of plink files, one for each chromosome.
TriadSim(input.plink.file, file.path(tempdir(),‘qtl’), fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(0.5, 1), risk.pathway.exposed=c(0.5, 1), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1, qtl=T)
The following call simulates a scenario that involves gene-environment interaction. The relative risk for the exposure main effect is 1.2. The relative risks for carrying the first and second pathway SNPs are 1.5 and 2 respectively for the exposed individuals and are 1 and 1 for the unexposed individuals.
TriadSim(input.plink.file, file.path(tempdir(),‘gxe’), fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1.2,risk.pathway.unexposed=c(1,1), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=0.3, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1, qtl=FALSE)
The following call simulates a stratified scenario that involves gene-environment interaction. The risk parameters are the same as the scenario above. The “input.plink.file”" is a list of two character vectors. Each vector contains three character strings giving the directory and basename of the PLINK files in one subpopulation.The subpopulations are equally sized (pop1.frac=0.5). The baseline disease prevalence (disease prevalence in the unexposed who carries 0 copy of the risk pathway) is 0.001 in the first subpopulation while that in the second subpopulation is 0.003 (0.001*3). The exposure prevalence in the two subpopulations are 0.1 and 0.3 respectively.
library(TriadSim) m.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_mom’) f.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_dad’) k.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_kid’) m.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_mom’) f.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_dad’) k.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_kid’) input.plink.file2 <- list(c(m.file, f.file, k.file),c(m.file2, f.file2, k.file2))
TriadSim(input.plink.file2, out.put.file=file.path(tempdir(),‘stratified’) , fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1.2,risk.pathway.unexposed=c(1,1), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=c(0.1, 0.3), pop1.frac=0.5, P0.ratio=3,rcmb.rate, no_cores=1)
To simulate case-control data the function needs to be called twice, calls to simulate cases (is.case=TRUE) and controls (is.case=FALSE) respectively. The script below calls the function to simulate 1000 cases and 1000 controls and writes genotypes of the cases and controls into seperate sets of PLINK files.
For example, use the following to create cases:
TriadSim(input.plink.file,file.path(tempdir(),‘case’) , fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=TRUE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
And use the following to create controls: TriadSim(input.plink.file, file.path(tempdir(),‘ctrl’), fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=TRUE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=FALSE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
Some additional details
The source data may contain genotyping errors that cause non-Mendelian inheritance patterns. For these non-Mendelian families, genotypes of the three individuals in the family will be set to missing at the corresponding SNPs. We assume nonpaternity and adoption have both been ruled out in QC for the source data.
This function requires at least two pathway SNPs, eithe two SNPs in the one pathway or two pathways each involving one SNP. If the users are interested in a single SNP scenario one can trick the function by setting the number of pathway to 2, each with a single SNP in the pathway but only the SNP in the first pathway carries a risk while that in the second pathway does not change risk. For example:
TriadSim(input.plink.file, file.path(tempdir(),‘singleSNP’), fr.desire=0.05,pathways=list(1,2),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 1), risk.pathway.exposed=c(1.5, 1), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
##Facility functions
The following set of functions is provided in case users want to have more control over the simulation parameters. They are called by the main function to generate simulations. Users do not need to call them directly.
###pick_target.snp
Users can manually pick the target SNPs in the pathway or use the facility function pick_target.snp
to pick the set of target SNPs in the pathway(s) based on a desired allele frequency. The example below uses the example files that come with the package to select 8 SNPs with allele frequencies close to 0.05. The function returns the selected target SNPs by giving the row numbers (i.e., the order) of the corresponding SNPs among all the SNPs in the associated “bim” file. For example a return of “1084 2044 3285 4016 5117 6067 7077 8187” means the SNPs at rows 1084 2044 3285 4016 5117 6067 7077 8187 are selected to be the target SNPs in the pathway.
m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
picked.target <- pick_target.snp(c(m.file,f.file),0.05, 8)
## [1] 21 118 121 140 155 168 218 383
cat('Target SNPs picked:',picked.target[[2]],'\n')
## Target SNPs picked: 21 118 121 140 155 168 218 383
###get.target.geno
The function get.target.geno
retrieves genotypes of the target SNPs and returns the genotypes of the triads in a list of three elements: the observed genotypes of the mothers from family 1 to family n repeated twice, genotypes of the fathers from family 1 to family n repeated twice and genotypes of children from family 1 to n followed by (stacking on top of) genotypes of the complements in the same family order.
target.snp <- picked.target[[2]]
m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
k.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_kid')
target.geno <- get.target.geno(c(m.file,f.file,k.file), target.snp,snp.all2)
The output target.geno is a list of three elements, each being a matrix of genotypes
To increase diversity, TriadSim introduces break points at each chromosome, selecting them independently for each triad being simulated. The break points can be picked manually or using the function get.brks
. The function tends to pick the break points at recombination hotspots if such data are passed in as an input parameter rcmb.rate.
found.brks <- get.brks(N.brk=3,n.ped=1000, snp.all2, target.snp,rcmb.rate=rcmb.rate)
breaks <- found.brks[[1]]
family.position <- found.brks[[2]]
This function returns a list of two items. The first is a 1000 x 17 matrix of integers showing where the chromosomal breaks are to take place (in terms of the order of the SNPs in the PLINK files) for each individual in the simulated trios. Each chromosome has 3 breaks, adding to that is the number of breaks between chromosomes, i.e., 3, and the first and the last SNPs, and this is where the 17 comes from. Here 1000 denotes the number of triads in the simulated data as defined by the n.ped
input parameter.
dim(breaks)
## [1] 1000 19
head(breaks)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0 99 120 129 150 162 173 181 253 262 297 314 326 352
## [2,] 0 106 120 123 145 164 173 202 226 262 297 313 317 339
## [3,] 0 76 118 133 149 161 173 197 233 263 297 309 326 337
## [4,] 0 96 120 136 152 159 173 196 250 272 297 313 317 337
## [5,] 0 27 120 137 154 166 173 204 232 263 297 306 320 339
## [6,] 0 35 119 130 142 161 173 201 249 264 297 298 324 339
## [,15] [,16] [,17] [,18] [,19]
## [1,] 355 360 376 394 412
## [2,] 355 360 385 398 412
## [3,] 355 370 379 408 412
## [4,] 355 372 384 398 412
## [5,] 355 360 384 407 412
## [6,] 355 369 387 407 412
The second one is a 1000 x 8 matrix showing the chromosomal segments out of which each target SNP is selected for each simulated trio.
dim(family.position)
## [1] 1000 8
head(family.position)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 2 3 4 5 6 8 17
## [2,] 1 2 3 4 5 6 8 16
## [3,] 1 2 3 4 5 6 8 17
## [4,] 1 2 3 4 5 6 8 16
## [5,] 1 2 3 4 5 6 8 16
## [6,] 1 2 3 4 5 6 8 16