MAGNAMWAR is a software package for bacterial genome wide association (GWA). Relative to standard approaches for GWA, e.g. in humans, bacterial genomes and phenotyping experiments have unique characteristics that suggest the use of different variant calling and statistical approaches may improve the association analysis (reviewed in Power, et al, 2017; Pubmed ID 27840430). MAGNAMWAR enables GWA based on bacterial gene presence or gene variant, permits the use of different statistical modeling approaches, and incorporates population structure into models based on user-defined parameters. Genes are clustered into orthologous groups (OGs) using the OrthoMCL gene clustering software, and can be statistically associated with raw or aggregate (e.g. mean) phenotype data. MAGNAMWAR is especially useful for performing associations when the phenotypes of phylogenetically disparate taxa are analyzed, and the calling of fine-scale variants (e.g. SNPs) is challenging or inappropriate. On the other hand, OrthoMCL gene clustering software may lack resolution when comparing phenotypes between strains from the same bacterial species, and OrthoMCL becomes computationally prohibitive as the number of sampled taxa increases above several hundred isolates. For such implementations, we recommend users consider other existing software reviewed in Power, et al. 2017 that are designed for processing hundreds or thousands of samples; or for performing SNP-based comparisons where it may be more appropriate to consider factors such as linkage. Users may be especially interested in the bacterial GWA software packages SEER (Pubmed ID 27633831), SCOARY (Pubmed ID 27887642), and PhenoLink (Pubmed ID 22559291), which have different strengths relative to MAGNAMWAR. Additionally, while MAGNAMWAR can be used to associate shotgun metagenomic sequencing data with corresponding sample phenotypes, nonsaturating sequencing of a sample could lead to false positive or false negative results.
This vignette describes a recommended workflow to take full advantage of MAGNAMWAR. The data are from a study on bacterial determinants of Drosophila melanogaster triglyceride content, and are representative of any number of datasets that associate one phenotype with one bacterial species. The bacterial genotypes were called based on individually cultured and sequenced bacterial species, obviating potential complications of non-saturating sequencing depth. The data included in the package and used in documentation examples are highly subsetted (2 orders of magnitude smaller) from the original dataset to increase speed of the example analyses. Most analyses can be run on a standard laptop computer, although we recommend a desktop computer with at least 16GB RAM for large (>500 phenotype measures) datasets. The functions are presented in their recommended order of operations using the case study fruit fly triglyceride data included in the package.
The following table outlines the specific input and outputs per function in the order each function should be used. Only the essential functions are listed; optional functions are discussed in their relevant sections.
Function | Purpose | Input | Output | Case Study Input | Case Study Output |
---|---|---|---|---|---|
FormatMCLFastas | prep amino acid fasta files for OrthoMCL analysis | 4-letter abbreviated amino acid fasta files of every host-mono-associated organism | concatenated fasta file of all inputs, removes any duplicate ids | fastas in extdata/fasta_dir/ | MCLformatted_all.fasta |
FormatAfterOrtho | format output of OrthoMCL clusters to be used with analyses | groups file from OrthoMCL | Parsed data contained in a list of 2 objects (presence absence matrix, protein ids) | extdata/groups_ example_r.txt | after_ortho_format_grps |
AnalyzeOrthoMCL | main analysis of data | FormatAfterOrtho output; phenotypic data | a matrix with 7 variables (cluster group, p-value, corrected p-value…) | after_ortho_format_grps; pheno_data | mcl_mtrx_grps |
JoinRepSeq | appends representative sequences with AnalyzeOrthoMCL matrix | FormatAfterOrtho output; fasta files; output matrix of AnalyzeOrthoMCL | a data frame of the joined matrix | after_ortho_format_grps; extdata/fasta_dir/; mcl_mtrx_grps | joined_mtrx_grps |
The first step is to assign each gene to a cluster of orthologous groups. This pipeline uses OrthoMCL for clustering, either through a local install (requires extensive RAM; see supplementary data) or web-based executable (http://orthomcl.org/orthomcl/; max 100,000 sequences). OrthoMCL software requires specific fasta sequence header formats and that there are no duplicate protein ids. The fasta header format contains 2 pipe-separated pieces of information: a unique 3-4 alphanumeric taxon designation and a unique protein ID classifier, e.g. >apoc|WP_000129691. Two functions aid in file formatting: FormatMCLFastas
, which formats genbank files, and RASTtoGBK
, which formats files from RAST.
FormatMCLFastas
converts amino-acid fasta files to an OrthoMCL compliant format and combines them into a concatenated file called “MCLformatted_all.fasta”. All amino acid fasta files must first be placed in a user-specified directory and given a 3-4 letter alphanumeric name, beginning with a non-numeric character (e.g. aac1.fasta). Example fasta files are included in the MAGNAMWAR package. The output “MCLformatted_all.fasta” file is the input for the OrthoMCL clustering software.
For fasta files that are not in the NCBI format, they should be converted in a separate folder to an NCBI-compatible format before running FormatMCLFastas
. For example, files in the RAST format have the initial header >fig|unique_identifer; not >ref|xxxx|xxxx|unique_identifier|annotation. To convert these or any other files to NCBI-compatible format, use the RASTtoGBK
function to merge the annotation using the unique identifier as a lookup, specifying the name of the fasta file (with a 3-4 letter alphanumeric name beginning with a non-numeric character), path to the reference file containing the annotation, and the output folder. If the reference file bearing the annotation is from RAST the lookup file can be downloaded as the ‘spreadsheet (tab separated text format)’. If the reference file is from a different source, format it so that the unique identifier is in the 2nd column and the reference annotation is in the 8th column.
lfrc_fasta <- system.file('extdata', 'RASTtoGBK//lfrc.fasta', package='MAGNAMWAR')
lfrc_reference <- system.file('extdata', 'RASTtoGBK//lfrc_lookup.csv', package='MAGNAMWAR')
lfrc_path <- system.file('extdata', 'RASTtoGBK//lfrc_out.fasta', package='MAGNAMWAR')
RASTtoGBK(lfrc_fasta,lfrc_reference,lfrc_path)
After formatting fasta files, run OrthoMCL clustering software either locally or online. OrthoMCL documentation describes how local installations can vary the inflation factor parameter to adjust the resolution of gene clustering.
More information about OrthoMCL clustering software can be found at: http://www.orthomcl.org/.
The FormatAfterOrtho
function reformats the direct output from OrthoMCL for the MAGNAMWAR pipeline. To use the command, call FormatAfterOrtho
, specifying the location of the OrthoMCL clusters file (online: OrthologGroups file; local: output of orthomclMclToGroups command) and whether the software was run online (“ortho”; default) or locally (“groups”). The following sample data were derived using local clustering.
# file_groups is the file path to your output file from OrthoMCL
parsed_data <- FormatAfterOrtho(file_groups, "groups")
The new stored variable is a list of matrices. The first matrix is a presence/absence matrix of taxa that bear each cluster of orthologous groups (OGs). The second matrix lists the specific protein ids within each cluster group. The data in each can be accessed by calling parsed_data[[1]]
or parsed_data[[2]]
, respectively, although this is not necessary for subsequent pipeline steps. Because it contains the specific protein ids for each OG, this list is a key input for most subsequent functions.
a subset of resulting variable parsed_data
:
parsed_data[[1]][,1:13]
## aace apap apas atro bsub dmos ecol efao efav ehor geur gfra ghan
## ex_r00000 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00001 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00002 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00003 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "0" "1"
## ex_r00004 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00005 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00006 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00007 "1" "1" "1" "1" "1" "0" "1" "1" "1" "1" "1" "1" "1"
## ex_r00008 "0" "1" "1" "0" "1" "0" "1" "1" "1" "1" "1" "0" "0"
## ex_r00009 "1" "1" "0" "0" "1" "0" "1" "0" "0" "1" "0" "0" "1"
## ex_r00010 "0" "1" "0" "1" "1" "0" "1" "1" "1" "1" "1" "1" "1"
parsed_data[[2]][,1:2]
## ex_r00000 ex_r00001
## V2 "aace|ZP_08699022.1" "bsub|NP_389730.2"
## V3 "apap|940282.4.peg.2315" "bsub|NP_390261.1"
## V4 "apas|ZP_16282535.1" "aace|ZP_08696506.1"
## V5 "atro|ZP_08644184.1" "apap|940282.4.peg.1501"
## V6 "bsub|NP_391359.1" "apas|ZP_16281966.1"
## V7 "dmos|ZP_08468863.1" "atro|ZP_08644394.1"
## V8 "ecol|NP_415408.1" "dmos|ZP_08469662.1"
## V9 "efao|YP_005708185.1" "ecol|NP_414920.1"
## V10 "efav|NP_815059.1" "efao|YP_005708912.1"
## V11 "ehor|ZP_08496675.1" "efav|NP_816072.1"
## V12 "geur|ZP_08902778.1" "ehor|ZP_08497259.1"
## V13 "gfra|ZP_11376413.1" "geur|ZP_08902509.1"
## V14 "ghan|ZP_06835021.1" "gfra|ZP_11376782.1"
## V15 "gobo|ZP_08896945.1" "ghan|ZP_06833288.1"
## V16 "gxyl|YP_004867578.1" "gobo|ZP_08898428.1"
## V17 "lani|ZP_08549469.1" "gxyl|YP_004868985.1"
## V18 "lbre|ZP_03940694.1" "lani|ZP_08548797.1"
## V19 "lbuc|YP_004398648.1" "lbuc|YP_004398815.1"
## V20 "lcas|YP_006751102.1" "lcas|YP_006752056.1"
## V21 "lfal|ZP_08312985.1" "lfal|ZP_08312677.1"
## V22 "lfer|ZP_03944435.1" "lfer|ZP_03944015.1"
## V23 "lfru|ZP_08652182.1" "lfru|ZP_08652638.1"
## V24 "llac|1358.26.peg.2055" "llac|1358.26.peg.706"
## V25 "lmai|ZP_09447100.1" "lmai|ZP_09447014.1"
## V26 "lmal|ZP_09441330.1" "lmal|ZP_09441246.1"
## V27 "lpla|YP_004888740.1" "lpla|YP_004888561.1"
## V28 "lrha|YP_003170666.1" "lrha|YP_003171609.1"
## V29 "lver|ZP_09442552.1" "lver|ZP_09443795.1"
## V30 "pbur|ZP_16361594.1" "pbur|ZP_16363149.1"
## V31 "pput|YP_001266156.1" "pput|YP_001270272.1"
## V32 "smar|AMF-0004114" "smar|AMF-0001686"
## V33 "spar|YP_006309044.1" "spar|YP_006310046.1"
## V34 "swit|YP_001264509.1" "swit|YP_001264948.1"
## V35 "efao|YP_005709159.1" "efao|YP_005707743.1"
## V36 "efav|NP_816368.1" "efav|NP_814698.1"
## V37 "lfer|ZP_03944497.1" "llac|1358.26.peg.28"
## V38 "lver|ZP_09443619.1" ""
The AnalyzeOrthoMCL
function performs the statistical tests to compare the phenotypes of taxa bearing or lacking each OG. It requires the R object output of the FormatAfterOrtho
function and a phenotype matrix containing the variables for the statistical tests. Seven different tests are supported, each deriving the significance of OG presence/absence on a response variable, and specified by the following codes:
To run AnalyzeOrthoMCL
, the following parameters are required:
FormatAfterOrtho
In order to create a data frame for your phenotypic data use the following command where file_path
is the path to your phenotypic data file: (the file_path
used in this case study is not available because pheno_data
exists as an .rda file)*
pheno_data <- read.table(file_path, header = TRUE, sep = ',')
A subset of the phenotypic file for TAG content pheno_data
is shown below:
## Treatment RespVar Vial Experiment
## 1 apoc 2.821429 A A
## 2 apoc 5.500000 B A
## 3 apoc 8.464286 C A
## 4 jacg 5.392857 A A
## 5 jacg 2.821429 B A
To run AnalyzeOrthoMCL
use the headers within pheno_data
to specify variables:
It is not necessary to specify the OG presence/absence variable, which is automatically populated from the FormatAfterOrtho
output. All other variables must be specified using column names in the phenotype matrix.
We will thus populate AnalyzeOrthoMCL
using the headers within pheno_data
to specify variables:
For example, to test the effect of gene presence/absence on fat content using a mixed model with nested random effects, the following command should be used:
mcl_matrix <- AnalyzeOrthoMCL(parsed_data,
pheno_data,
model = 'lmeR2nest',
species_name = 'Treatment',
resp = 'RespVar',
rndm1 = 'Experiment',
rndm2 = 'Vial')
An optional but highly recommended parameter of AnalyzeOrthoMCL
is the princ_coord
parameter, which allows the user to incorporate population structure. The options for this parameter are 1, 2, 3, or a decimal. Numbers 1-3 specify how many principal coordinates to include as fixed effects in the model and a decimal specifies to use as many principal coordinates as needed for that decimal percentage of the variance to be accounted for. When a user specifies one of these options, the software automatically calculates a principal coordinates matrix from the FormatAfterOrtho
presence absence matrix, and incorporated the specified number of principal coordinates into the model See below in the QQ Plot section to see an example of plotted analyses using principal coordinates in the model, and how incorporating population structure into the models can improve the statistical distribution of the results.
The resulting output matrix contains 7 columns:
mcl_mtrx[,1:3]
Showing the first three columns of mcl_matrix
.
## OG pval1 corrected_pval1
## [1,] "ex_r00000" "0.0680152959317386" "0.748168255249125"
## [2,] "ex_r00001" "0.0680152959317386" "0.748168255249125"
## [3,] "ex_r00002" "0.000169433279973097" "0.00186376607970407"
## [4,] "ex_r00003" "0.217474911482261" "1"
## [5,] "ex_r00004" "0.4482634415059" "1"
## [6,] "ex_r00005" "0.0899794361093273" "0.989773797202601"
## [7,] "ex_r00007" "0.999790593888518" "1"
## [8,] "ex_r00008" "0.00669725842125168" "0.0736698426337685"
## [9,] "ex_r00009" "0.186956618632854" "1"
## [10,] "ex_r00010" "0.00824936629366202" "0.0907430292302822"
AnalyzeOrthoMCL
also provides for analysis using a survival model. A survival analysis is a method for calculating the difference between two treatments on time-series data, which may not always be normally distributed and is most likely to be relevant to host, but perhaps not bacterial, phenotypes. Users should provide a data frame of their phenotypic data, similar to the data frame described above for the other models. For more information see the survival
package, which includes several useful guides and vignettes (https://CRAN.R-project.org/package=survival).
Briefly, for each individual in a population that reaches a benchmark event, such as death, the time of death is recorded in the ‘time’ column, and a ‘1’ is recorded in the ‘event’ column, indicating the individual died at time X. If an individual left an experiment prematurely (e.g. moved away from a location where a study was conducted, a fly escaped from a vial before death), the event column is labelled ‘0’ to indicate it survived until that point in the experiment. Other metadata, including the sample name, are included on the same row as these 2 data points under other columns in a matrix.
## EXP VIAL BACLO TRT t1 t2 event
## 1 1 01A-6 1 lbrc 0 26.33333 1
## 2 1 01A-7 1 lbrc 0 19.00000 1
## 3 1 01A-8 1 lbrc 0 26.73077 1
Because each individual is specified individually, survival analyses are often conducted on large datasets with thousands of measurements. To expedite the use of survival analyses in iterative BGWA testing, our survival analysis can be multi-threaded to run on multiple cores, using the multi
option. The survmulticensor
option is included to break up the tests for further parallelization. This function allows the user to optionally input a startnum
and stopnum
signifying which and how many tests to run. The output of those certain tests can then be written into the output_dir
where the SurvAppendMatrix
function can be used to concatenate all small tests together.
mcl_mtrx <- AnalyzeOrthoMCL(after_ortho_format, starv_pheno_data, species_name = 'TRT', model='survmulti', time='t2', event='event', rndm1='EXP', rndm2='VIAL', multi=1)
JoinRepSeq
randomly selects a representative protein annotation and amino acid sequence from each OG and appends it to the AnalyzeOrthoMCL
output matrix. The purpose is to identify OG function. The result is a data frame bearing 4 additional variables:
Four inputs are specified for JoinRepSeq
:
FormatAfterOrtho
AnalyzeOrthoMCL
dir <- system.file('extdata', 'fasta_dir', package='MAGNAMWAR')
dir <- paste(dir,'/',sep='')
joined_matrix <- JoinRepSeq(after_ortho_format_grps, dir, mcl_mtrx_grps, fastaformat = 'old')
##
## picking and writing representative sequence for PDG:
## 9.09%__18.18%__27.27%__36.36%__45.45%__54.55%__63.64%__72.73%__81.82%__90.91%__
joined_matrix[1,8:10]
An example of three of the appended columns are shown below:
## rep_taxon rep_id rep_annot
## 1 atro ZP_08644184.1 thioredoxin reductase
MAGNAMWAR statistical analysis can be exported into either tab-separated matrices or into graphical elements.
Allows the user to output all protein sequences in a specified OG in fasta format.
When survival models are used in the MGWA a single .csv file is printed for each test. SurvAppendMatrix
joins each individual file into one complete matrix.
Writes the matrix result from AnalyzeOrthoMCL
into a tab-separated file.
MAGNAMWAR also offers several options for graphical outputs. The several ways to visualize data are explained in detail below.
PDGPlot
visualizes the phenotypic effect of bearing a OG. The phenotype matrix is presented as a bar chart, and gene presence/absence is represented by different bar shading.
Calling PDGPlot
requires the same pheno_data
variable as AnalyzeOrthoMCL
and similarly takes advantage of the user specified column names from the pheno_data
data frame. It also requires the mcl_matrix
object and specification of which OG to highlight.
For example, to visualize the means and standard deviations of the taxa which are present in OG “ex_r00002”, the OG with the lowest corrected p-value (“0.00186”). Green and gray bars represent taxa that do or do not contain a gene in the OG, respectively.
PDGPlot(pheno_data, mcl_matrix, OG = 'ex_r00002', species_colname = 'Treatment', data_colname = 'RespVar', ylab = "TAG Content")
The different taxa can be ordered alphabetically, as above, or by specifying the order with either the tree
parameter (for phylogenetic sorting) or the order
parameter, which takes a vector to determine order. For example, the taxa can be ordered by phylogeny by calling a phylogenetic tree file with the taxon names as leaves (note any taxa in pheno_data
that are not in the tree file will be omitted).
tree <- system.file('extdata', 'muscle_tree2.dnd', package='MAGNAMWAR')
PDGPlot(pheno_data, mcl_matrix, 'ex_r00002', 'Treatment', 'RespVar', ylab = "TAG Content", tree = tree)
PhyDataError
is similar to PDGPlot and adds visualization of phylogenetic tree with the phenotypic means and standard deviation.
Calling PhyDataError
requires the following parameters:
pheno_data
variable as AnalyzeOrthoMCL
(and its column names)mcl_matrix
objectPhyDataError(tree, pheno_data, mcl_matrix, species_colname = 'Treatment', data_colname = 'RespVar', OG='ex_r00002', xlabel='TAG Content')
PDGvOG
produces a histogram of the number of OGs in each phylogenetic distribution group (PDG). A PDG is the set of OGs present and absent in the exact same set of bacterial taxa. The main purpose of the graph is to determine the fraction of OGs that are present in unique or shared sets of taxa, since the phenotypic effect of any OGs in the same PDG cannot be discriminated from each other.
To run PDGvOG
, provide the output of FormatAfterOrtho
. The num
parameter (default 40) is used to specify the amount of OGs per PDG that should be included on the x axis.
For example, in the graph below there are 11 PDGs only contain one OG, meaning that 11 PDGs have one group that has a unique distribution of taxa present and absent.
PDGvOG(parsed_data,0)
Because this data is an extreme subset, it isn’t very informative. A full data set is shown below. This shows us that in this particular OrthoMCL clustering data, 4822 PDGs exist with only 1 unique OG in the PDG, and around 500 PDGs exist with 2 OGs that share the same presence and absence of taxa and so on.
A simple quartile-quartile plot function that is generated using the matrix of AnalyzeOrthoMCL
. To run QQPlotter
, provide the output matrix of AnalyzeOrthoMcl
.
QQPlotter(mcl_matrix)
Using this function, we provide a visualization of the use of the princ_coord
parameter in AnalyzeOrthoMCL
. Notice how each additional principal coordinate reduces statistical inflation. In practice it is usually computationally inexpensive to run several iterations of the analysis with a different number of principal coordinates to test how their incorporation improves the statistical fit.
ManhatGrp
produces a manhattan plot for visual analysis of the output of AnalyzeOrthoMCL
. In a traditional genome wide association study, a Manhattan plot is a visualization tool to identify significant p-values and potential linkage disequilibrium blocks. A traditional Manhattan plot sorts p-value by the SNPs along the 23 chromosomes. In our Manhattan plot, we sort by taxa instead of by chromosome number as shown below. The lines that emerge in our plot show the clustered proteins across taxa (which because they are in the same OG would have the same p-value).
To run ManhatGrp
, provide the output of FormatAfterOrtho
and the output of AnalyzeOrthoMCL
as shown below.
ManhatGrp(parsed_data, mcl_matrix)
## ordering by protein id within taxa...
## creating manhattan plot...
## finished.
Again because this data is an extreme subset, it isn’t very informative. Another full data set example is shown below.