cytoftree
: extension of cytometree
to analyze mass cytometry datacytoftree
cytoftree
is an extension to cytometree
function to analyze mass cytometry data. These data are specific due to a high number of zero and the high number of markers (up to 100 potentially). cytoftree
is based on cytometree
’s algorithm which is the construction of binary tree, whose nodes represents cell sub-populations, and slighly modified to take into account the specification of mass cytometry data.
According to the literature, mass cytometry data must be transform to get best partitions. We propose different transformations: asinh
(as default), biexp
, log10
or none
(without transformation).
At each node, for each marker, the cells with zero values are temporarily set aside from the other cells.
The remaining observed cells (or “events”) and markers are modeled by both a normal distribution (so unimodal), and a mixture of 2 normal distributions (so bimodal).
If the AIC normalized differences \(D\) are significant, the cells are split into 2 groups according to the bimodal distribution. Cells with low values are annotated -
(no marker) while cells with high values are annotated +
(with marker). The cells with zero values are also annotated -
(no marker).
The binary tree is constructed until the cells can no longer be split into 2 groups.
Given the unsupervised nature of the binary tree, some of the available markers may not be used to find the different cell populations present in a given sample. To recover a complete annotation, we defined, as a post processing procedure, an annotation method which allows the user to distinguish two (or three) expression levels per marker.
cytoftree
In this example, we will use an influenza vaccine response dataset (from ImmuneSpace), with 39 markers. To speed-up the computation, we sampled 10 000 cells from this dataset.
First, we can look the structure and the markers of the data.
## [1] 10000 39
## [1] "Time" "Cell_length" "(In113)Dd_CD57" "(In115)Dd_Dead"
## [5] "(Ce140)Dd_Bead" "(Nd142)Dd_CD19" "(Nd143)Dd_CD4" "(Nd144)Dd_CD8"
## [9] "(Nd146)Dd_IgD" "(Sm147)Dd_CD85j" "(Nd148)Dd_CD11c" "(Sm149)Dd_CD16"
## [13] "(Nd150)Dd_CD3" "(Eu151)Dd_CD38" "(Sm152)Dd_CD27" "(Eu153)Dd_CD11b"
## [17] "(Sm154)Dd_CD14" "(Gd155)Dd_CCR6" "(Gd156)Dd_CD94" "(Gd157)Dd_CD86"
## [21] "(Gd158)Dd_CXCR5" "(Tb159)Dd_CXCR3" "(Gd160)Dd_CCR7" "(Dy162)Dd_CD45RA"
## [25] "(Dy164)Dd_CD20" "(Ho165)Dd_CD127" "(Er166)Dd_CD33" "(Er167)Dd_CD28"
## [29] "(Er168)Dd_CD24" "(Tm169)Dd_ICOS" "(Er170)Dd_CD161" "(Yb171)Dd_TCRgd"
## [33] "(Yb172)Dd_PD-1" "(Yb173)Dd_CD123" "(Yb174)Dd_CD56" "(Lu175)Dd_HLADR"
## [37] "(Yb176)Dd_CD25" "(Ir191)Dd_DNA1" "(Ir193)Dd_DNA2"
Then, we also check the proportion of zero for each marker, particularity of mass cytometry data.
zero_proportion <- apply(IMdata[, -c(1, 2)], MARGIN = 2, FUN = function(x) {
round(prop.table(table(x == 0))["TRUE"] * 100, 2)
})
zero_proportion
## (In113)Dd_CD57 (In115)Dd_Dead (Ce140)Dd_Bead (Nd142)Dd_CD19
## 76.26 60.55 91.00 70.34
## (Nd143)Dd_CD4 (Nd144)Dd_CD8 (Nd146)Dd_IgD (Sm147)Dd_CD85j
## 38.92 45.04 70.61 55.14
## (Nd148)Dd_CD11c (Sm149)Dd_CD16 (Nd150)Dd_CD3 (Eu151)Dd_CD38
## 54.75 63.78 18.75 16.30
## (Sm152)Dd_CD27 (Eu153)Dd_CD11b (Sm154)Dd_CD14 (Gd155)Dd_CCR6
## 23.19 33.66 56.47 68.81
## (Gd156)Dd_CD94 (Gd157)Dd_CD86 (Gd158)Dd_CXCR5 (Tb159)Dd_CXCR3
## 63.89 57.71 56.91 52.07
## (Gd160)Dd_CCR7 (Dy162)Dd_CD45RA (Dy164)Dd_CD20 (Ho165)Dd_CD127
## 29.54 8.67 49.11 34.57
## (Er166)Dd_CD33 (Er167)Dd_CD28 (Er168)Dd_CD24 (Tm169)Dd_ICOS
## 39.43 31.08 64.00 74.00
## (Er170)Dd_CD161 (Yb171)Dd_TCRgd (Yb172)Dd_PD-1 (Yb173)Dd_CD123
## 76.15 90.25 81.13 75.48
## (Yb174)Dd_CD56 (Lu175)Dd_HLADR (Yb176)Dd_CD25 (Ir191)Dd_DNA1
## 66.72 45.16 56.40 1.42
## (Ir193)Dd_DNA2
## 2.63
According to the available markers, a gating strategy may be considered. In this example, we have a gating strategy to conserve only viable cells by splitting on the following markers : DNA1
, DNA2
, Cell_length
, Bead
and Dead
. This way, we can be as close as possible to manual gating. To do this, we have to force the markers with the force_first_marker
option (semi-supervised gating).
Then, to improve the performance of automating gating, we decided to transform data with asinh
transformation (default transformation). Then, we have to choose which markers should be transformed using num_col
argument. The columns Time
et Cell_length
are not mass cytometry measure and shouldn’t be transformed.
num_col <- c(3:ncol(IMdata))
tree <- CytofTree(M = IMdata, minleaf = 1, t = 0.1, verbose = FALSE, force_first_markers = c("(Ir191)Dd_DNA1",
"(Ir193)Dd_DNA2", "Cell_length", "(Ce140)Dd_Bead", "(In115)Dd_Dead"), transformation = "asinh",
num_col = num_col)
## Unable to force split on (In115)Dd_Dead for some node at level5
## Unable to force split on (In115)Dd_Dead for some node at level5
## It works !
## [1] 824
Due to the high number of markers, cytoftree
provides high number of sub-populations. minleaf
value for the minimum of cells by sub-population and t
threshold for the depth of the binary tree can be modified to get more or less sub-populations. The plot_graph
function provides a look on the binary tree, but should be unreadable due to the high number of sub-populations.
The annotation
function allows to recover the incomplete annotation on sub-populations. combinations
option provides the complete annotation on each sub-population.
## Time Cell_length (In113)Dd_CD57 (In115)Dd_Dead (Ce140)Dd_Bead
## 787 1 0 0 0 0
## 593 1 0 0 0 0
## 622 1 0 0 0 0
## 818 1 0 0 0 0
## 636 1 0 0 0 0
## (Nd142)Dd_CD19 (Nd143)Dd_CD4 (Nd144)Dd_CD8 (Nd146)Dd_IgD (Sm147)Dd_CD85j
## 787 0 1 0 0 0
## 593 0 0 1 0 0
## 622 1 0 0 1 1
## 818 0 1 0 0 0
## 636 0 0 0 0 0
## (Nd148)Dd_CD11c (Sm149)Dd_CD16 (Nd150)Dd_CD3 (Eu151)Dd_CD38 (Sm152)Dd_CD27
## 787 0 0 1 0 1
## 593 0 0 1 0 1
## 622 0 0 0 0 0
## 818 0 0 1 0 1
## 636 0 0 0 0 0
## (Eu153)Dd_CD11b (Sm154)Dd_CD14 (Gd155)Dd_CCR6 (Gd156)Dd_CD94 (Gd157)Dd_CD86
## 787 0 0 0 0 0
## 593 0 0 0 0 0
## 622 0 0 1 0 0
## 818 0 0 0 0 0
## 636 0 0 0 0 0
## (Gd158)Dd_CXCR5 (Tb159)Dd_CXCR3 (Gd160)Dd_CCR7 (Dy162)Dd_CD45RA
## 787 0 0 1 1
## 593 0 0 1 1
## 622 1 0 0 1
## 818 0 0 1 0
## 636 0 0 0 0
## (Dy164)Dd_CD20 (Ho165)Dd_CD127 (Er166)Dd_CD33 (Er167)Dd_CD28 (Er168)Dd_CD24
## 787 0 0 0 1 0
## 593 0 0 0 1 0
## 622 1 0 0 0 0
## 818 0 0 0 1 0
## 636 0 0 0 0 0
## (Tm169)Dd_ICOS (Er170)Dd_CD161 (Yb171)Dd_TCRgd (Yb172)Dd_PD-1
## 787 0 0 0 0
## 593 0 0 0 0
## 622 0 0 0 0
## 818 0 0 0 0
## 636 0 0 0 0
## (Yb173)Dd_CD123 (Yb174)Dd_CD56 (Lu175)Dd_HLADR (Yb176)Dd_CD25
## 787 0 0 0 0
## 593 0 0 0 0
## 622 0 0 1 0
## 818 0 0 0 0
## 636 0 0 0 0
## (Ir191)Dd_DNA1 (Ir193)Dd_DNA2 leaves count prop
## 787 1 1 787 357 0.0357
## 593 1 1 593 182 0.0182
## 622 1 1 622 119 0.0119
## 818 1 1 818 96 0.0096
## 636 0 0 636 93 0.0093
Due to the high number of sub-populations, it’s recommended to use RetrievePops
function which provide informations for particular sub-populations.
RetrievePops
: providing informations for particular sub-populationsRetrievePops
provides several informations on specific sub-populations, in particular the proportions and the sub-populations merged.
phenotypes <- list()
phenotypes[["CD4+"]] <- rbind(c("(Ir191)Dd_DNA1", 1), c("(Ir193)Dd_DNA2", 1), c("Cell_length",
0), c("(Ce140)Dd_Bead", 0), c("(In115)Dd_Dead", 0), c("(Sm154)Dd_CD14", 0), c("(Er166)Dd_CD33",
0), c("(Nd150)Dd_CD3", 1), c("(Nd143)Dd_CD4", 1))
phenotypes[["CD8+"]] <- rbind(c("(Ir191)Dd_DNA1", 1), c("(Ir193)Dd_DNA2", 1), c("Cell_length",
0), c("(Ce140)Dd_Bead", 0), c("(In115)Dd_Dead", 0), c("(Sm154)Dd_CD14", 0), c("(Er166)Dd_CD33",
0), c("(Nd150)Dd_CD3", 1), c("(Nd144)Dd_CD8", 1))
pheno_result <- RetrievePops(annot, phenotypes = phenotypes)
# CD4+
pheno_result$phenotypesinfo[[1]]
## $phenotype
## [1] "(Ir191)Dd_DNA1-1" "(Ir193)Dd_DNA2-1" "Cell_length-0" "(Ce140)Dd_Bead-0"
## [5] "(In115)Dd_Dead-0" "(Sm154)Dd_CD14-0" "(Er166)Dd_CD33-0" "(Nd150)Dd_CD3-1"
## [9] "(Nd143)Dd_CD4-1"
##
## $proportion
## [1] 0.1953
##
## $Mergedlabels
## [1] 787 818 786 817 739 773 385 697 471 714 814 824 801 469 806 802 618 640
## [19] 643 789 808 774 822 562 670 220 821 326 804 820 472 556 642 712 66 528
## [37] 557 615 715 770 171 470 713 811 819 215 617 619 646 741 247 355 382 501
## [55] 668 696 816 176 216 351 558 559 666 669 671 698 699 717 718 772 807 327
## [73] 353 586 647 665 672 767 812 815 174 181 183 352 356 387 500 560 561 620
## [91] 644 645 648 716 771 805 113 170 172 175 182 219 221 246 284 386 466 529
## [109] 555 667 788
##
## $Newlabel
## [1] 825
## $phenotype
## [1] "(Ir191)Dd_DNA1-1" "(Ir193)Dd_DNA2-1" "Cell_length-0" "(Ce140)Dd_Bead-0"
## [5] "(In115)Dd_Dead-0" "(Sm154)Dd_CD14-0" "(Er166)Dd_CD33-0" "(Nd150)Dd_CD3-1"
## [9] "(Nd144)Dd_CD8-1"
##
## $proportion
## [1] 0.0752
##
## $Mergedlabels
## [1] 593 673 700 621 474 535 591 186 220 588 289 507 563 429 532 587 590 675 180
## [20] 215 253 329 389 506 534 360 391 565 566 68 114 179 187 216 291 362 388 473
## [39] 564 719 742 743 330 359 361 390 530 178 181 183 290 392 533 182 185 475 531
## [58] 592
##
## $Newlabel
## [1] 826
We can compare proportions providing by automatic gating (cytoftree
) and manual gating for the selected sub-populations.
CD4+ | CD8+ | |
---|---|---|
Manual Gating | 0.182 | 0.065 |
Automating Gating | 0.195 | 0.075 |
cytoftree
provides good results, close to the proportions getting by manual gating.