Computing a PGS using RápidoPGS-multi

Guillermo Reales

2023-10-13

Introduction

RápidoPGS is a tool to quickly compute polygenic scores (PGS) from GWAS summary statistics datasets from either case-control (eg. asthma) or quantitative traits (eg. height and BMI).

Input GWAS summary statistics datasets should have a minimum set of columns, with these names, but in any order:

Also, if doing a PGS on a quantitative trait GWAS dataset, your file must include this:

Installation of RápidoPGS

Current RápidoPGS version (v2.2.0) is available on CRAN, so we can install it using the following code

install.packages("RapidoPGS")

Alternatively, you can download the development version from Github (Note: If you don’t have remotes installed, please install it first:

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_github("GRealesM/RapidoPGS")

Setup

Once installed, let’s load it. This will automatically load all required dependencies.

library(RapidoPGS)

Preparing reference panel

RápidoPGS-multi is slightly less simple to run than RápidoPGS-single, as it requires LD matrices to inform on the relationships between SNPs in the data. To that end we have two options: (1) Use a reference panel, and (2) Use pre-computed LD matrices. In this section we will deal with the first option.

A reference panel is simply genomic data obtained from a representative sample of a larger population. For example, the UK Biobank cohort comprises ~500,000 individuals between 40 and 69 years old from the United Kingdom and can be used as a reference for the UK (or more roughly, European) population. Unfortunately, UK Biobank data is only available to researches upon application, so in this example we will use the fully public 1000 Genomes Project Phase III as a reference panel.

The 1000 Genomes Project Phase III comprises genomic data from 2504 individuals from multiple world populations, and although its sample size is relatively small compared to UK Biobank, it has the nice feature of being open for everyone to use it.

Prior to running RápidoPGS-multi, we need to set up our reference panel. To easen things, you can use the create_1000G, which will do everything for you. Bear in mind that this will take ~60GB of disk memory and a while, depending of your internet connection, so be warned! You can tweak the QC parameters as you see fit. See function documentation for more information.

create_1000G(directory = "ref-data", remove.related=TRUE, qc.maf = 0.01, qc.hwe=1e-10, qc.geno=0, autosomes.only=TRUE)

And that should be it!

Preparing LD matrices (UK Biobank case)

Alternatively, we can use pre-computed LD matrices (from a larger panel, for example). In our paper, we use pre-computed LD matrices from the UK Biobank panel, made publicly available by LDpred2 authors here.

These matrices comprise ~1 million HapMap3 variants, and were computed on 362,320 UK Biobank individuals. We can download and set them up using the following code:

dir.create("ukbb")  # Or whatever you like
download.file("https://ndownloader.figshare.com/articles/13034123/versions/3", destfile="ukbb/LD_ukbb.zip", mode="wb")
unzip(zipfile="ukbb/LD_ukbb.zip", exdir="ukbb/") 

Note: If you get some error or warning when unzipping from the R session (eg. “Zip file is corrupt” warning), try to unzip normally, using either a terminal or right-clicking on the file and extracting it.

With this, your LD matrix directory should be ready. We prepared rapidopgs_multi() function to work out of the box with these LD matrices, which are (1) Separated by chromosomes, (2) class “dsCMatrix”, (3) contained in .rds files, and (4) accompanied by a manifest, map.rds, containing information on the variants in the LD matrices (HapMap3 variants in this case).

Note that these matrices are computed using a European population, so they won’t be the best fit for generating a PGS for a non-European population. If that is your case, you can use the reference panel option, as it contains many populations from most ancestries.

At the moment, this is the only format RápidoPGS supports for pre-computed LD matrices, and we cannot foresee all possible formats, so if you want to use your own matrices with RápidoPGS-multi, please ensure to provide them in the described format.

A brief note on ancestries: By default, RápidoPGS expects European ancestry data, which is the most common in the GWAS field – and an issue to be tackled. However, as a beta, untested option, RápidoPGS can now accommodate other ancestries. To do so, you’ll need to provide either LD matrices or a reference panel from the non-European population, via the LDmatrices or reference arguments, respectively. You’ll also need to provide an LD block file via the LDblocks argument. This file should be analog to our EUR_ld.blocks files in data/ and be in the same format. At the moment we haven’t tested RápidoPGS on different ancestries, but feel free to try and let us know if you find issues doing so.

Loading data

GWAS catalog is an outstanding GWAS summary statistics data source, as it not only puts together tons of public datasets, but it also uses a semi-automatic pipeline to apply quality control and liftover (a.k.a. harmonise) those datasets, thus helping overcome the problem of having GWAS sumstats in so many different formats and builds.

In this vignette, we’ll use GWAS catalog to obtain an example dataset. For this vignette we’ll use a Breast cancer (BRCA) dataset from Michailidou et al., 2017, which is one that we used in our manuscript. This GWAS was performed on 119078 controls and 137045 cases of Breast cancer.

IMPORTANT NOTE: Input GWAS summary statistics and your reference panel (or LD matrices) must be in the same build. Both 1000 Genomes and UK Biobank LD matrices are in GRCh37/hg19 build, while harmonised GWAS catalog files come in GRCh38/hg38.

You can use gwascat.download() to download files from GWAS catalog see the RápidoPGS single vignette. In this case, we downloaded the hg19 version of the file here, sampled 100,000 random SNPs (for educational purposes only, you don’t need to sample SNPs when building your PGS), and saved it as michailidou19, which we’ll now load.

ds <- michailidou19

Computing a PGS using RápidoPGS-multi and LD matrices

Now we’re ready to compute our PGS using UK Biobank LD matrices, this is easily done as follows.

model.LDmat <- rapidopgs_multi(ds, LDmatrices = "ukbb", N = 256123, build = "hg19", trait="cc")

Note that default \(\alpha_{block}\) is \(10^-4\) and \(\alpha_{SNP}\) is 0.01, which is a fast setting, but sheds off many SNPs. Also, sample size (N) is now required by RápidoPGS for both case-control and quantitative datasets. For case-control datasets, N = Ncases + Ncontrols.

You selected hg19 build. Please note that your reference or LDmatrices must be in the same build as your dataset.
Assigning LD blocks using pre-computed hg19 LD blocks.
Done!
Running RapidoPGS-multi model with multiple causal variant assumption for a GWAS summary statistics dataset in hg19, of EUR ancestry, and with N = 256123.
100,000 variants to be matched.
0 ambiguous SNPs have been removed.
9,207 variants have been matched; 0 were flipped and 0 were reversed.
Working on chromosome 1.
  |======================================================================| 100%
Working on chromosome 2.
  |======================================================================| 100%
Working on chromosome 3.
  |======================================================================| 100%
Working on chromosome 4.
  |======================================================================| 100%
Working on chromosome 5.
  |======================================================================| 100%
Working on chromosome 6.
  |======================================================================| 100%
Working on chromosome 7.
  |======================================================================| 100%
Working on chromosome 8.
  |======================================================================| 100%
Working on chromosome 9.
  |======================================================================| 100%
Working on chromosome 10.
  |======================================================================| 100%
Working on chromosome 11.
  |======================================================================| 100%
Working on chromosome 12.
  |======================================================================| 100%
Working on chromosome 13.
  |======================================================================| 100%
Working on chromosome 14.
  |======================================================================| 100%
Working on chromosome 15.
  |======================================================================| 100%
Working on chromosome 16.
  |======================================================================| 100%
Working on chromosome 17.
  |======================================================================| 100%
Working on chromosome 18.
  |======================================================================| 100%
Working on chromosome 19.
  |======================================================================| 100%
Working on chromosome 20.
  |======================================================================| 100%
Working on chromosome 21.
  |======================================================================| 100%
Working on chromosome 22.
  |======================================================================| 100%

After this model.LDmat$weight will contain the weights for each SNP. Bear in mind that depending on your \(\alpha\) settings, your model will have much fewer SNPs than your input file.

Computing a PGS using RápidoPGS-multi and 1000 Genomes Phase III panel

We can compute our PGS using 1000 Genomes Phase III panel as follows:

model.refpanel <- rapidopgs_multi(ds, reference = "ref-data",  N = 256123, build = "hg19", trait="cc", ncores=8)

Here, instead of LDmatrices, we just need to set reference to point to the directory in which our panel lives.

Note that default \(\alpha_{block}\) is \(10^{-4}\) and \(\alpha_{SNP}\) is 0.01, which is a fast setting, but sheds off many SNPs. Also, default ncores = 1, but this approach benefits from parallelisation, so we used 8 CPUs in this example.

You selected hg19 build. Please note that your reference or LDmatrices must be in the same build as your dataset.
Assigning LD blocks using pre-computed hg19 LD blocks.
Done!
Running RapidoPGS-multi model with multiple causal variant assumption for a GWAS summary statistics dataset in hg19, of EUR ancestry, and with N = 256123.
Working on chromosome 1.
Matching and aligning SNPs in chr1 to the reference
7,940 variants to be matched.
489 ambiguous SNPs have been removed.
3,514 variants have been matched; 0 were flipped and 2,660 were reversed.
  |======================================================================| 100%
Working on chromosome 2.
Matching and aligning SNPs in chr2 to the reference
8,221 variants to be matched.
566 ambiguous SNPs have been removed.
3,582 variants have been matched; 0 were flipped and 2,747 were reversed.
  |======================================================================| 100%
Working on chromosome 3.
Matching and aligning SNPs in chr3 to the reference
7,131 variants to be matched.
555 ambiguous SNPs have been removed.
3,207 variants have been matched; 0 were flipped and 2,481 were reversed.
  |======================================================================| 100%
Working on chromosome 4.
Matching and aligning SNPs in chr4 to the reference
7,380 variants to be matched.
538 ambiguous SNPs have been removed.
Some duplicates were removed.
3,313 variants have been matched; 0 were flipped and 2,553 were reversed.
  |======================================================================| 100%
Working on chromosome 5.
Matching and aligning SNPs in chr5 to the reference
6,421 variants to be matched.
432 ambiguous SNPs have been removed.
Some duplicates were removed.
2,948 variants have been matched; 0 were flipped and 2,280 were reversed.
  |======================================================================| 100%
Working on chromosome 6.
Matching and aligning SNPs in chr6 to the reference
6,760 variants to be matched.
469 ambiguous SNPs have been removed.
3,020 variants have been matched; 0 were flipped and 2,335 were reversed.
  |======================================================================| 100%
Working on chromosome 7.
Matching and aligning SNPs in chr7 to the reference
5,930 variants to be matched.
426 ambiguous SNPs have been removed.
2,621 variants have been matched; 0 were flipped and 1,984 were reversed.
  |======================================================================| 100%
Working on chromosome 8.
Matching and aligning SNPs in chr8 to the reference
5,408 variants to be matched.
378 ambiguous SNPs have been removed.
2,313 variants have been matched; 0 were flipped and 1,823 were reversed.
  |======================================================================| 100%
Working on chromosome 9.
Matching and aligning SNPs in chr9 to the reference
4,239 variants to be matched.
337 ambiguous SNPs have been removed.
Some duplicates were removed.
1,931 variants have been matched; 0 were flipped and 1,470 were reversed.
  |======================================================================| 100%
Working on chromosome 10.
Matching and aligning SNPs in chr10 to the reference
4,952 variants to be matched.
339 ambiguous SNPs have been removed.
2,245 variants have been matched; 0 were flipped and 1,713 were reversed.
  |======================================================================| 100%
Working on chromosome 11.
Matching and aligning SNPs in chr11 to the reference
4,925 variants to be matched.
354 ambiguous SNPs have been removed.
2,174 variants have been matched; 0 were flipped and 1,670 were reversed.
  |======================================================================| 100%
Working on chromosome 12.
Matching and aligning SNPs in chr12 to the reference
4,903 variants to be matched.
339 ambiguous SNPs have been removed.
2,225 variants have been matched; 0 were flipped and 1,740 were reversed.
  |======================================================================| 100%
Working on chromosome 13.
Matching and aligning SNPs in chr13 to the reference
3,647 variants to be matched.
270 ambiguous SNPs have been removed.
1,630 variants have been matched; 0 were flipped and 1,252 were reversed.
  |======================================================================| 100%
Working on chromosome 14.
Matching and aligning SNPs in chr14 to the reference
3,289 variants to be matched.
247 ambiguous SNPs have been removed.
1,417 variants have been matched; 0 were flipped and 1,085 were reversed.
  |======================================================================| 100%
Working on chromosome 15.
Matching and aligning SNPs in chr15 to the reference
2,843 variants to be matched.
218 ambiguous SNPs have been removed.
1,253 variants have been matched; 0 were flipped and 999 were reversed.
  |======================================================================| 100%
Working on chromosome 16.
Matching and aligning SNPs in chr16 to the reference
3,250 variants to be matched.
264 ambiguous SNPs have been removed.
1,384 variants have been matched; 0 were flipped and 1,072 were reversed.
  |======================================================================| 100%
Working on chromosome 17.
Matching and aligning SNPs in chr17 to the reference
2,779 variants to be matched.
191 ambiguous SNPs have been removed.
1,255 variants have been matched; 0 were flipped and 961 were reversed.
  |======================================================================| 100%
Working on chromosome 18.
Matching and aligning SNPs in chr18 to the reference
2,735 variants to be matched.
207 ambiguous SNPs have been removed.
1,271 variants have been matched; 0 were flipped and 1,013 were reversed.
  |======================================================================| 100%
Working on chromosome 19.
Matching and aligning SNPs in chr19 to the reference
2,342 variants to be matched.
176 ambiguous SNPs have been removed.
1,068 variants have been matched; 0 were flipped and 830 were reversed.
  |======================================================================| 100%
Working on chromosome 20.
Matching and aligning SNPs in chr20 to the reference
2,130 variants to be matched.
145 ambiguous SNPs have been removed.
922 variants have been matched; 0 were flipped and 702 were reversed.
  |======================================================================| 100%
Working on chromosome 21.
Matching and aligning SNPs in chr21 to the reference
1,329 variants to be matched.
102 ambiguous SNPs have been removed.
582 variants have been matched; 0 were flipped and 439 were reversed.
  |======================================================================| 100%
Working on chromosome 22.
Matching and aligning SNPs in chr22 to the reference
1,446 variants to be matched.
76 ambiguous SNPs have been removed.
642 variants have been matched; 0 were flipped and 487 were reversed.
  |======================================================================| 100%

After this model.refpanel$weight will contain the weights for each SNP. Bear in mind that depending on your \(\alpha\) settings, your model will have much fewer SNPs than your input file.

Conclusion

So those are examples of basic RápidoPGS-multi usage!

Drop us a line if you encounter problems, and we’ll be happy to help.

Good luck!

Guillermo