In this vignette we explore to what extent the peak predictions are
sensitive to the spatial correlation in typical genomic data. We also
demonstrate how to efficiently compute coverage profiles from raw
aligned read data using the data.table
package.
First we load a data set with one row for every genomic region with a unique aligned read, and compute mean read size in bases.
library(data.table)
data(ChIPreads, package="PeakSegDisk", envir=environment())
(experiments <- ChIPreads[, .(
mean.bases=mean(chromEnd-chromStart),
median.bases=median(chromEnd-chromStart),
chromStart=min(chromStart)
), by=list(experiment)])
#> experiment mean.bases median.bases chromStart
#> <char> <num> <num> <int>
#> 1: H3K36me3 93.88568 100 111387372
#> 2: H3K4me3 95.36224 99 175434087
From the table above it is clear that the average read size is about 100 for these two experiments.
Below, for each of the two data sets, we compute a data table with two representations of these aligned reads: count each read at each aligned base in the read (this induces spatial correlation), or just the end/last base of the read (this has no spatial correlation).
end.counts <- ChIPreads[, list(
count=.N #ignores dup reads, sum(count) would not.
), by=list(experiment, chrom, chromEnd)]
(aligned.dt <- rbind(
ChIPreads[, .(
bases.counted="each", experiment, chrom,
chromStart, chromEnd,
count=1)], #ignore duplicate reads.
end.counts[, .(
bases.counted="end", experiment, chrom,
chromStart=chromEnd-1L, chromEnd,
count)]))
#> bases.counted experiment chrom chromStart chromEnd count
#> <char> <char> <char> <int> <int> <num>
#> 1: each H3K36me3 chr9 111387372 111387472 1
#> 2: each H3K36me3 chr9 111387436 111387536 1
#> 3: each H3K36me3 chr9 111387443 111387543 1
#> 4: each H3K36me3 chr9 111387455 111387554 1
#> 5: each H3K36me3 chr9 111387566 111387666 1
#> ---
#> 74841: end H3K4me3 chr2 175506898 175506899 1
#> 74842: end H3K4me3 chr2 175506912 175506913 1
#> 74843: end H3K4me3 chr2 175506943 175506944 1
#> 74844: end H3K4me3 chr2 175506956 175506957 1
#> 74845: end H3K4me3 chr2 175507001 175507002 1
Each row of the data table above describes how one read should be counted in order to compute an aligned read profile.
aligned.dt[, {
as.list(quantile(chromEnd-chromStart))
}, by=.(bases.counted, experiment)]
#> bases.counted experiment 0% 25% 50% 75% 100%
#> <char> <char> <num> <num> <num> <num> <num>
#> 1: each H3K36me3 20 98 100 100 115
#> 2: each H3K4me3 20 97 99 100 107
#> 3: end H3K36me3 1 1 1 1 1
#> 4: end H3K4me3 1 1 1 1 1
The table above shows that when we only count the end of each base, we only count the read at one position. When we count the read at each position, that means from 20 to 115 bases, depending on the read.
Next we compute a count profile for each base in these genomic regions.
(seq.dt <- aligned.dt[, {
event.dt <- rbind(
data.table(count, pos=chromStart+1L),
data.table(count=-count, pos=chromEnd+1L))
edge.vec <- event.dt[, {
as.integer(seq(min(pos), max(pos), l=100))
}]
event.bins <- rbind(
event.dt,
data.table(count=0L, pos=edge.vec))
total.dt <- event.bins[, .(
count=sum(count)
), by=list(pos)][order(pos)]
total.dt[, cum := cumsum(count)]
total.dt[, bin.i := cumsum(pos %in% edge.vec)]
## it is somewhat confusing because total.dt pos is the first base
## with cum, and cum goes all the way up to but not including the
## pos of the next row.
total.dt[, data.table(
chromStart=pos[-.N]-1L,
chromEnd=pos[-1]-1L,
count=cum[-.N],
bin.i=bin.i[-.N])]
}, by=list(bases.counted, experiment, chrom)])
#> bases.counted experiment chrom chromStart chromEnd count bin.i
#> <char> <char> <char> <int> <int> <num> <int>
#> 1: each H3K36me3 chr9 111387372 111387436 1 1
#> 2: each H3K36me3 chr9 111387436 111387443 2 1
#> 3: each H3K36me3 chr9 111387443 111387455 3 1
#> 4: each H3K36me3 chr9 111387455 111387472 4 1
#> 5: each H3K36me3 chr9 111387472 111387536 3 1
#> ---
#> 124619: end H3K4me3 chr2 175506943 175506944 1 99
#> 124620: end H3K4me3 chr2 175506944 175506956 0 99
#> 124621: end H3K4me3 chr2 175506956 175506957 1 99
#> 124622: end H3K4me3 chr2 175506957 175507001 0 99
#> 124623: end H3K4me3 chr2 175507001 175507002 1 99
In contrast to the previous data tables, the table above contains no
information about individual reads. Each row represents how many
reads, count
, have been counted at all positions between
chromStart+1
and chromEnd
. We plot these aligned read counts
below:
if(require(ggplot2)){
gg.data <- ggplot()+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(
bases.counted ~ experiment,
scales="free",
labeller=label_both)+
geom_step(aes(
chromStart/1e3, count, color=data.type),
data=data.table(seq.dt, data.type="exact"))+
scale_color_manual(values=c(
exact="black",
bins="red",
model="deepskyblue"
))+
scale_x_continuous("Position on hg19 chrom (kb = kilo bases)")
print(gg.data)
}
It is clear from the plot above that, for each experiment (left:H3K36, right:H3K4), the profiles in the top and bottom plots have peaks in similar regions.
Next we compute mean profile in bins, using the bin.i
column we
created earlier, which assigns each genomic region to one of 99 bins.
bin.dt <- seq.dt[, {
bases <- chromEnd - chromStart
data.table(
binStart=min(chromStart),
binEnd=max(chromEnd),
mean.count=sum(count*bases)/sum(bases),
bases=sum(bases)
)}, by=list(bases.counted, experiment, bin.i)]
if(require(ggplot2)){
gg.bins <- gg.data+
geom_step(aes(
binStart/1e3, mean.count, color=data.type),
alpha=0.75,
linewidth=1,
data=data.table(bin.dt, data.type="bins"))+
scale_y_log10("Aligned DNA sequence reads (log scale)")
print(gg.bins)
}
#> Warning in scale_y_log10("Aligned DNA sequence reads (log scale)"): log-10 transformation
#> introduced infinite values.
The mean counts in each bin appear in the plot above as a red line.
Next we compute the optimal segmentation model with 2 peaks for each data set.
if(interactive() && requireNamespace("future"))future::plan("multisession")
segs.dt <- seq.dt[, {
data.dir <- file.path(tempdir(), bases.counted, experiment)
dir.create(data.dir, showWarnings=FALSE, recursive=TRUE)
coverage.bedGraph <- file.path(data.dir, "coverage.bedGraph")
fwrite(
.SD[, .(chrom, chromStart, chromEnd, count)],
coverage.bedGraph,
sep="\t",
quote=FALSE,
col.names=FALSE)
fit <- PeakSegDisk::sequentialSearch_dir(data.dir, 2L, verbose=1)
data.table(fit$segments, data.type="model")
}, by=list(bases.counted, experiment)]
#> Le chargement a nécessité le package : future.apply
#> Next = 0, Inf
#> Next = 110.889323103247
#> Next = 1385.23387063487
#> Next = 25425.6282786644
#> Next = 373396.954452715
#> Next = 116338.388376185
#> Le chargement a nécessité le package : future.apply
#> Next = 0, Inf
#> Next = 192.185946075722
#> Next = 11247.0775923531
#> Next = 176157.439626319
#> Le chargement a nécessité le package : future.apply
#> Next = 0, Inf
#> Next = 4.37603211086448
#> Next = 8.94288335356839
#> Next = 32.2910962473304
#> Next = 839.262095714024
#> Next = 0, Inf
#> Next = 4.20444060519522
#> Next = 20.44214129621
#> Next = 618.412457529431
changes.dt <- segs.dt[, {
.SD[-1]
}, by=list(bases.counted, experiment, data.type)]
if(require(ggplot2)){
gg.model <- gg.bins+
geom_segment(aes(
chromStart/1e3, mean,
xend=chromEnd/1e3, yend=mean,
color=data.type),
data=segs.dt)+
geom_vline(aes(
xintercept=chromEnd/1e3,
color=data.type),
data=changes.dt)
print(gg.model)
}
#> Warning in scale_y_log10("Aligned DNA sequence reads (log scale)"): log-10 transformation
#> introduced infinite values.
The plot above shows that the predicted peaks (blue) of the models occur in similar positions, using either the each or end representations of the data.
Next we compute the difference between predicted peak positions:
peaks.dt <- segs.dt[status=="peak"]
peaks.dt[, peak.i := rep(1:2, l=.N)]
peak.pos.tall <- melt(
peaks.dt,
measure.vars=c("chromStart", "chromEnd"))
peak.pos.wide <- dcast(
peak.pos.tall,
experiment + variable + peak.i ~ bases.counted)
peak.pos.wide[, diff.bases := abs(each-end)]
read.size.panel <- "each"
bases.max.dt <- seq.dt[, .(max.count=max(count)), by=list(bases.counted)]
read.size.y <- bases.max.dt[
read.size.panel, max.count, on=list(bases.counted)]
diff.panel <- "end"
diff.y <- bases.max.dt[
diff.panel, max.count, on=list(bases.counted)]
diff.y <- Inf
diff.vjust <- 1.1
if(require(ggplot2)){
gg.model+
geom_text(aes(
chromStart/1e3, read.size.y, label=sprintf(
"Median read size:\n%.0f bases",
median.bases)),
hjust=0,
vjust=1,
data=data.table(experiments, bases.counted=read.size.panel))+
geom_text(aes(
end/1e3, diff.y,
label=diff.bases,
color=data.type),
data=data.table(
bases.counted=diff.panel,
data.type="model",
peak.pos.wide),
vjust=diff.vjust,
hjust=0)+
geom_text(aes(
chromStart/1e3, diff.y,
label="Peak position\ndifference in bases:",
color=data.type,
),
hjust=0,
vjust=diff.vjust,
data=data.table(
data.type="model",
bases.counted=diff.panel,
experiments["H3K36me3", on=list(experiment)]))
}
#> Warning in scale_y_log10("Aligned DNA sequence reads (log scale)"): log-10 transformation
#> introduced infinite values.
In the plot above the blue numbers show the differences (in bases) between the top and bottom panel peak predictions. It is clear that the peak predictions are highly consistent between the each/end representations, with variation on the order of read size (100 bases).
Overall these results indicate that the peak detection algorithm is highly robust to the spatial correlation that is present in typical ChIP-seq coverage profile data.