TriadSim is a package that can simulate genotypes for case-parent triads, case-control, and quantitative trait samples with realistic linkage diequilibrium structure and allele frequency distribution. For studies of epistasis one can simulate models that involve specific SNPs at specific sets of loci, which we will refer to as “pathways”. TriadSim generates genotype data by resampling triad genotypes from existing data. It takes genotypes in PLINK format as the input files. The genotypes for the mothers, fathers, and children are in separate files. The mothers, fathers, and children must be from the same set of triad families although the ordering of the families can be different for the three sets of data. After reading in the genotypes, a sorting step will reorder the families so that the three individuals in each family can realign.

Main function TriadSim

TriadSim is the main function to perform the simulations. The example function call below simulates genotype data for 1000 case-parent triads for 4 chromosomes (chromsomes 1, 8 17, 20) under a genetic main effect scenario with a baseline disease prevalence of P0=0.001 and genetic relative risks of 1.5 and 2 for carrying the first and the second pathway respectively. This function call will write output files in PLINK. The output file names and path to the directory are given by the parameter “out.put.file” and the chromosome number. Each set (.bim, .bed and .fam files) of PLINK files contain genotype data for one chromosome for all simulated samples. The name of the file is the concatenation of the value of the input parameter “out.put.file” and chromosome number. For example, if “out.put.file” is set to be “triad”, the names of the output files will be triad1, triad8, triad17 and triad20 for our example. See R package documentation for more details.

library(TriadSim)
 m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
 f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
 k.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_kid')
 input.plink.file <- c(m.file, f.file, k.file)

 TriadSim(input.plink.file, out.put.file=file.path(tempdir(),'triad'), fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)
## [1]  21 118 121 140 155 168 218 383
## coercing object of mode  numeric  to SnpMatrix
## Writing FAM file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad1.fam 
## Writing extended MAP file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad1.bim 
## Writing BED file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad1.bed (SNP-major mode)
## coercing object of mode  numeric  to SnpMatrix
## Writing FAM file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad8.fam 
## Writing extended MAP file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad8.bim 
## Writing BED file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad8.bed (SNP-major mode)
## coercing object of mode  numeric  to SnpMatrix
## Writing FAM file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad17.fam 
## Writing extended MAP file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad17.bim 
## Writing BED file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad17.bed (SNP-major mode)
## coercing object of mode  numeric  to SnpMatrix
## Writing FAM file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad20.fam 
## Writing extended MAP file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad20.bim 
## Writing BED file to C:\Users\shi2\AppData\Local\Temp\Rtmpi4RtHp/triad20.bed (SNP-major mode)

The following call simulates a quantitative trait (by setting “qtl=T”). The function will create 4 sets of plink files, one for each chromosome.

TriadSim(input.plink.file, file.path(tempdir(),‘qtl’), fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(0.5, 1), risk.pathway.exposed=c(0.5, 1), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1, qtl=T)

The following call simulates a scenario that involves gene-environment interaction. The relative risk for the exposure main effect is 1.2. The relative risks for carrying the first and second pathway SNPs are 1.5 and 2 respectively for the exposed individuals and are 1 and 1 for the unexposed individuals.

TriadSim(input.plink.file, file.path(tempdir(),‘gxe’), fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1.2,risk.pathway.unexposed=c(1,1), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=0.3, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1, qtl=FALSE)

The following call simulates a stratified scenario that involves gene-environment interaction. The risk parameters are the same as the scenario above. The “input.plink.file”" is a list of two character vectors. Each vector contains three character strings giving the directory and basename of the PLINK files in one subpopulation.The subpopulations are equally sized (pop1.frac=0.5). The baseline disease prevalence (disease prevalence in the unexposed who carries 0 copy of the risk pathway) is 0.001 in the first subpopulation while that in the second subpopulation is 0.003 (0.001*3). The exposure prevalence in the two subpopulations are 0.1 and 0.3 respectively.

library(TriadSim) m.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_mom’) f.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_dad’) k.file <- file.path(system.file(package = “TriadSim”),‘extdata/pop1_4chr_kid’) m.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_mom’) f.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_dad’) k.file2 <- file.path(system.file(package = “TriadSim”),‘extdata/pop2_4chr_kid’) input.plink.file2 <- list(c(m.file, f.file, k.file),c(m.file2, f.file2, k.file2))

TriadSim(input.plink.file2, out.put.file=file.path(tempdir(),‘stratified’) , fr.desire=0.3,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1.2,risk.pathway.unexposed=c(1,1), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=c(0.1, 0.3), pop1.frac=0.5, P0.ratio=3,rcmb.rate, no_cores=1)

To simulate case-control data the function needs to be called twice, calls to simulate cases (is.case=TRUE) and controls (is.case=FALSE) respectively. The script below calls the function to simulate 1000 cases and 1000 controls and writes genotypes of the cases and controls into seperate sets of PLINK files.

For example, use the following to create cases:

TriadSim(input.plink.file,file.path(tempdir(),‘case’) , fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=TRUE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)

And use the following to create controls: TriadSim(input.plink.file, file.path(tempdir(),‘ctrl’), fr.desire=0.05,pathways=list(1:4,5:8),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=TRUE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 2), risk.pathway.exposed=c(1.5, 2), is.case=FALSE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)

Some additional details

The source data may contain genotyping errors that cause non-Mendelian inheritance patterns. For these non-Mendelian families, genotypes of the three individuals in the family will be set to missing at the corresponding SNPs. We assume nonpaternity and adoption have both been ruled out in QC for the source data.

This function requires at least two pathway SNPs, eithe two SNPs in the one pathway or two pathways each involving one SNP. If the users are interested in a single SNP scenario one can trick the function by setting the number of pathway to 2, each with a single SNP in the pathway but only the SNP in the first pathway carries a risk while that in the second pathway does not change risk. For example:

TriadSim(input.plink.file, file.path(tempdir(),‘singleSNP’), fr.desire=0.05,pathways=list(1,2),n.ped=1000, N.brk=3, target.snp=NA,P0=0.001,is.OR=FALSE,risk.exposure= 1,risk.pathway.unexposed=c(1.5, 1), risk.pathway.exposed=c(1.5, 1), is.case=TRUE, e.fr=NA, pop1.frac=NA, P0.ratio=1,rcmb.rate, no_cores=1)

##Facility functions

The following set of functions is provided in case users want to have more control over the simulation parameters. They are called by the main function to generate simulations. Users do not need to call them directly.

###pick_target.snp Users can manually pick the target SNPs in the pathway or use the facility function pick_target.snp to pick the set of target SNPs in the pathway(s) based on a desired allele frequency. The example below uses the example files that come with the package to select 8 SNPs with allele frequencies close to 0.05. The function returns the selected target SNPs by giving the row numbers (i.e., the order) of the corresponding SNPs among all the SNPs in the associated “bim” file. For example a return of “1084 2044 3285 4016 5117 6067 7077 8187” means the SNPs at rows 1084 2044 3285 4016 5117 6067 7077 8187 are selected to be the target SNPs in the pathway.

 m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
 f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
 picked.target <- pick_target.snp(c(m.file,f.file),0.05, 8)
## [1]  21 118 121 140 155 168 218 383
 cat('Target SNPs picked:',picked.target[[2]],'\n')
## Target SNPs picked: 21 118 121 140 155 168 218 383

###get.target.geno The function get.target.geno retrieves genotypes of the target SNPs and returns the genotypes of the triads in a list of three elements: the observed genotypes of the mothers from family 1 to family n repeated twice, genotypes of the fathers from family 1 to family n repeated twice and genotypes of children from family 1 to n followed by (stacking on top of) genotypes of the complements in the same family order.

target.snp <- picked.target[[2]]
 m.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_mom')
 f.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_dad')
 k.file <- file.path(system.file(package = "TriadSim"),'extdata/pop1_4chr_kid')
target.geno <- get.target.geno(c(m.file,f.file,k.file), target.snp,snp.all2)

The output target.geno is a list of three elements, each being a matrix of genotypes

To increase diversity, TriadSim introduces break points at each chromosome, selecting them independently for each triad being simulated. The break points can be picked manually or using the function get.brks. The function tends to pick the break points at recombination hotspots if such data are passed in as an input parameter rcmb.rate.

found.brks <- get.brks(N.brk=3,n.ped=1000, snp.all2, target.snp,rcmb.rate=rcmb.rate)
breaks <- found.brks[[1]]
family.position <- found.brks[[2]] 

This function returns a list of two items. The first is a 1000 x 17 matrix of integers showing where the chromosomal breaks are to take place (in terms of the order of the SNPs in the PLINK files) for each individual in the simulated trios. Each chromosome has 3 breaks, adding to that is the number of breaks between chromosomes, i.e., 3, and the first and the last SNPs, and this is where the 17 comes from. Here 1000 denotes the number of triads in the simulated data as defined by the n.ped input parameter.

dim(breaks)
## [1] 1000   19
head(breaks)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    0   99  120  129  150  162  173  181  253   262   297   314   326   352
## [2,]    0  106  120  123  145  164  173  202  226   262   297   313   317   339
## [3,]    0   76  118  133  149  161  173  197  233   263   297   309   326   337
## [4,]    0   96  120  136  152  159  173  196  250   272   297   313   317   337
## [5,]    0   27  120  137  154  166  173  204  232   263   297   306   320   339
## [6,]    0   35  119  130  142  161  173  201  249   264   297   298   324   339
##      [,15] [,16] [,17] [,18] [,19]
## [1,]   355   360   376   394   412
## [2,]   355   360   385   398   412
## [3,]   355   370   379   408   412
## [4,]   355   372   384   398   412
## [5,]   355   360   384   407   412
## [6,]   355   369   387   407   412

The second one is a 1000 x 8 matrix showing the chromosomal segments out of which each target SNP is selected for each simulated trio.

dim(family.position)
## [1] 1000    8
head(family.position)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    2    3    4    5    6    8   17
## [2,]    1    2    3    4    5    6    8   16
## [3,]    1    2    3    4    5    6    8   17
## [4,]    1    2    3    4    5    6    8   16
## [5,]    1    2    3    4    5    6    8   16
## [6,]    1    2    3    4    5    6    8   16