Contents

1 Overview

Consider the scenario where each batch has been clustered separately, and each cluster has already been annotated with some meaningful biological state. We want to perform correction to examine the relationships between clusters across batches, but in a manner that is more likely to preserve these meaningful within-batch clusters. This is achieved by performing the correction on the cluster centroids and then applying the changes to the per-cell values, which is the basis of the clusterMNN() function in batchelor.

2 Method description

The first step is to compute the centroids for each cluster by simply summing the (log-)expression values for all cells in the same cluster. Note that this is done after cosine normalization if cos.norm=TRUE, see the normalization discussion for more details.

The second step is to perform a PCA on the centroid expression profiles in multiBatchPCA(). We will use all PCs in the downstream analysis, so the PCA itself has no denoising role; rather, we are only using it to allows us to return a LowRankMatrix of per-cell corrected values and to impute corrected values for genes that were not used in the PCA. We don’t need denoising as the cluster centroids should be relatively stable and we don’t need compaction as there are not many clusters compared to cells.

The third step is to use the centroid-level PCs in the reducedMNN() with k=1. The choice of k is important as it guarantees that each cluster matches no more than one other cluster in another batch. We ensure that the correction does not merge together different clusters in the same batch, which would otherwise have been a possibility if many-to-one MNN pairs were permitted. It also allows us to pull out “metaclusters” of clusters based on following single linkages between batches; this would not be reliably done with k > 1 where the entire set of clusters would become a metacluster.

The final, mostly cosmetic step is to use the corrected per-cluster values to correct the per-cell profiles. We compute per-cell correction vectors by averaging across the per-cluster vectors, using a smoothing Gaussian kernel with the bandwidth set to the median distance from each cell to its assigned cluster. This accounts for local differences in the correction while avoiding discrete edge effects if we had just applied each cluster’s correction to its constituent cells.

3 Demonstration on pancreas data

3.1 Setup

First, we set up the Grun human pancreas dataset. We will skip some of the more involved processing for the sake of brevity; read the OSCA book for the not-so-lazy way of doing things.

library(scRNAseq)
X <- GrunPancreasData()

library(scater)
X <- addPerCellQC(X)
qcX <- quickPerCellQC(colData(X))
X <- X[,!qcX$discard]

library(scran)
X <- logNormCounts(X)
decX <- modelGeneVar(X)

# No labels available, so we make our own clusters instead:
X <- runPCA(X, ncomponents=20, subset_row=getTopHVGs(decX, n=2000))
g <- buildSNNGraph(X, use.dimred="PCA")
X$cluster <- igraph::cluster_walktrap(g)$membership

We repeat this for the Muraro human pancreas dataset:

Y <- MuraroPancreasData()
Y <- addPerCellQC(Y)
qcY <- quickPerCellQC(colData(Y))
Y <- Y[,!qcY$discard]

Y <- logNormCounts(Y)
decY <- modelGeneVar(Y)

# Not strictly necessary, but we do this for consistency with above:
Y <- runPCA(Y, ncomponents=20, subset_row=getTopHVGs(decY, n=2000))
g <- buildSNNGraph(Y, use.dimred="PCA")
Y$cluster <- igraph::cluster_walktrap(g)$membership

We compute the HVGs and normalization across batches to get rid of major differences in coverage. Again, read the book for an explanation.

universe <- intersect(rownames(X), rownames(Y))
combined <- combineVar(decX[universe,], decY[universe,])
chosen <- getTopHVGs(combined, n=5000)

library(batchelor)
stuff <- multiBatchNorm(X[universe,], Y[universe,])

3.2 Before correction

One evaluation of the correction is to see if it preserves the separation between clusters identified in the original data. Clusters identified in the same batch obviously don’t suffer from a batch effect, so they shouldn’t be altered by correction; we would ideally see that the clusters remain separated before and after correction.

To obtain a pre-correction baseline for both datasets, we compute cluster purity values using the clusterPurity() function from scran. This defines the purity based on the contamination of each cluster’s region of space by cells from another cluster.

out <- clusterPurity(reducedDim(X), X$cluster, transposed=TRUE)
boxplot(split(out, X$cluster), main="Grun original")

non.na <- !is.na(Y$label)
out <- clusterPurity(reducedDim(Y)[non.na,], Y$label[non.na], transposed=TRUE)
boxplot(split(out, Y$label[non.na]), main="Muraro original")

Values close to 1 indicate that all cells of a cluster are surrounded only by other cells from the same cluster. Our aim here is to compare these distributions before and after each correction.

3.3 Reference correction with fastMNN()

We run fastMNN() to provide a standard for comparing clusterMNN(). We can see decent mixing on the \(t\)-SNE plot:

ref <- fastMNN(stuff[[1]], stuff[[2]], subset.row=chosen)
ref <- runTSNE(ref, dimred="corrected")
plotTSNE(ref, colour_by="batch")

As well as similar purity values as that before correction. Some of the Grun values drop but others increase so we’ll call it even.

out1 <- clusterPurity(reducedDim(ref)[ref$batch==1,], 
    X$cluster, transposed=TRUE)
boxplot(split(out1, X$cluster), main="Grun (fastMNN)")

out2 <- clusterPurity(reducedDim(ref)[ref$batch==2,][non.na,], 
    Y$label[non.na], transposed=TRUE)
boxplot(split(out2, Y$label[non.na]), main="Muraro (fastMNN)")

We could also trick clusterPurity() in quantifying the mixing between batches. Perfectly mixed batches of the same composition and size should see purity values close to \(n^{-1}\) for \(n\) batches. In practice, differences in population composition will cause values to lie anywhere in \([0, 1]\) so I’m not sure that this is worth all that much.

all.clust <- c(X$cluster, Y$label)
n <- table(all.clust)
w <- 1/n[all.clust]
keep <- !is.na(all.clust)

out3 <- clusterPurity(reducedDim(ref)[keep,], 
    ref$batch[keep], 
    weighted=w[keep],
    transposed=TRUE)
boxplot(split(out3, ref$batch))

3.4 Semi-supervised correction

Grun doesn’t have labels, so we’ll just use its clusters in conjunction with Muraro’s labels (we need to remove the NA labels ourselves). We also set subset.row for consistency, though this is less important here as the cluster centroids are less affected by noise than per-cell expression values.

m.out <- clusterMNN(stuff[[1]], stuff[[2]][,non.na], subset.row=chosen,
    clusters=list(X$cluster, Y$label[non.na]))

The main output of clusterMNN() is not actually the per-cell corrected values. Instead, it is the information about which labels/clusters match up with each other across batches. We can extract this from the metadata() by grouping clusters into their metaclusters, whereby we can see that most of the main cell types in the Muraro dataset have matching clusters in the Grun data. This is often sufficient for downstream interpretation if the clusters/labels have already been characterized.

clust.info <- metadata(m.out)$cluster
split(clust.info$cluster, clust.info$meta)
## $`1`
## [1] "1"    "beta"
## 
## $`2`
## [1] "2"    "duct"
## 
## $`3`
## [1] "3"
## 
## $`4`
## [1] "4"     "alpha"
## 
## $`5`
## [1] "5"      "acinar"
## 
## $`6`
## [1] "6"
## 
## $`7`
## [1] "7"     "delta"
## 
## $`8`
## [1] "8"           "mesenchymal"
## 
## $`9`
## [1] "9"
## 
## $`10`
## [1] "10"
## 
## $`11`
## [1] "11"
## 
## $`12`
## [1] "12"
## 
## $`13`
## [1] "13" "pp"
## 
## $`14`
## [1] "endothelial"
## 
## $`15`
## [1] "epsilon"
## 
## $`16`
## [1] "unclear"

Nonetheless, Figure 1 of our Nature paper has to come from somewhere, and so clusterMNN() will also return the per-cell corrected values as a courtesy to enable users to make those damn \(t\)-SNE plots. Again, we can see that the mixing of different cell types across batches is satisfactory without loss of much within-batch heterogeneity.

t.out <- runTSNE(m.out, dimred="corrected")
plotTSNE(t.out, colour_by="batch")

out1 <- clusterPurity(reducedDim(m.out)[m.out$batch==1,],
    X$cluster, transposed=TRUE)
boxplot(split(out1, X$cluster), main="Grun (clusterMNN)")

out2 <- clusterPurity(reducedDim(m.out)[m.out$batch==2,],
    Y$label[non.na], transposed=TRUE)
boxplot(split(out2, Y$label[non.na]), main="Muraro (clusterMNN)")

4 Demonstration on brain data

4.1 Setup

We will repeat this evaluation on a few mouse brain datasets. First, we set up the classic, beloved Zeisel dataset.

library(scRNAseq)
Z <- ZeiselBrainData()

library(scater)
Z <- addPerCellQC(Z, subsets=list(Mito=grep("mt-", rownames(Z))))
qcZ <- quickPerCellQC(colData(Z), percent_subsets="subsets_Mito_percent")
Z <- Z[,!qcZ$discard]

library(scran)
Z <- logNormCounts(Z)
decZ <- modelGeneVar(Z)

# Not technically necessary, but why not? 
Z <- runPCA(Z, ncomponents=20, subset_row=getTopHVGs(decZ, n=2000))

We repeat this for the Tasic dataset:

A <- TasicBrainData()
A <- addPerCellQC(A, subsets=list(Mito=grep("mt_", rownames(A))))
qcA <- quickPerCellQC(colData(A))
A <- A[,!qcA$discard]

A <- logNormCounts(A)
decA <- modelGeneVar(A)

# Not technically necessary, but why not? 
A <- runPCA(A, ncomponents=20, subset_row=getTopHVGs(decA, n=2000))

We compute the HVGs and adjust the scale of coverage.

universe <- intersect(rownames(Z), rownames(A))
combined <- combineVar(decZ[universe,], decA[universe,])
chosen <- getTopHVGs(combined, n=5000)

library(batchelor)
stuff <- multiBatchNorm(Z[universe,], A[universe,])

4.2 Before correction

We examine the cluster purities before any correction, using the author-provided labels available for both datasets.

out <- clusterPurity(reducedDim(Z), Z$level1class, transposed=TRUE)
boxplot(split(out, Z$level1class), main="Zeisel original")

out <- clusterPurity(reducedDim(A), A$broad_type, transposed=TRUE)
boxplot(split(out, A$broad_type), main="Tasic original")

4.3 Reference correction with fastMNN()

We again run fastMNN() to provide a standard for comparison. We see decent mixing as well as some clear Zeisel-specific cell types.

ref <- fastMNN(stuff[[1]], stuff[[2]], subset.row=chosen)
ref <- runTSNE(ref, dimred="corrected")
plotTSNE(ref, colour_by="batch")

out1 <- clusterPurity(reducedDim(ref)[ref$batch==1,],
    Z$level1class, transposed=TRUE)
boxplot(split(out1, Z$level1class), main="Zeisel (fastMNN)")

out2 <- clusterPurity(reducedDim(ref)[ref$batch==2,],
    A$broad_type, transposed=TRUE)
boxplot(split(out2, A$broad_type), main="Tasic (fastMNN)")

4.4 Semi-supervised correction

A key trick is to remove the unclassified cells from the Tasic dataset. This is literally a mystery bag of cells from across the coordinate space that have no assigned identity; the concept of a cluster centroid does not make sense here, and keeping these cells will cause inappropriate MNNs to form with a meaningless “unclassified” cell type.

not.unclass <- A$broad_type!="Unclassified"
m.out2 <- clusterMNN(stuff[[1]], stuff[[2]][,not.unclass], subset.row=chosen,
    clusters=list(Z$level1class, A$broad_type[not.unclass]))

We can see that the metaclusters are quite sensible across batches.

clust.info <- metadata(m.out2)$cluster
split(clust.info$cluster, clust.info$meta)
## $`1`
## [1] "astrocytes_ependymal" "Astrocyte"           
## 
## $`2`
## [1] "endothelial-mural" "Endothelial Cell" 
## 
## $`3`
## [1] "interneurons"      "GABA-ergic Neuron"
## 
## $`4`
## [1] "microglia" "Microglia"
## 
## $`5`
## [1] "oligodendrocytes" "Oligodendrocyte" 
## 
## $`6`
## [1] "pyramidal CA1"
## 
## $`7`
## [1] "pyramidal SS"         "Glutamatergic Neuron"
## 
## $`8`
## [1] "Oligodendrocyte Precursor Cell"

Cluster purities are also comparable to what we saw for the original and fastMNN()-corrected values:

out1 <- clusterPurity(reducedDim(m.out2)[m.out2$batch==1,],
    Z$level1class, transposed=TRUE)
boxplot(split(out1, Z$level1class), main="Zeisel (clusterMNN)")

out2 <- clusterPurity(reducedDim(m.out2)[m.out2$batch==2,],
    A$broad_type[not.unclass], transposed=TRUE)
boxplot(split(out2, A$broad_type[not.unclass]), main="Tasic (clusterMNN)")

One key point is that the correction does not care - at all - about the structure within each cluster. In this case, the neuronal clusters are actually fairly heterogeneous, but no attempt is made by clusterMNN() to match up the internal structure of each cluster across batches. This causes some incomplete mixing within clusters compared to the fastMNN() output, but clusterMNN()’s use case doesn’t really care about that.

m.out2 <- runTSNE(m.out2, dimred="corrected")
plotTSNE(m.out2, colour_by="batch")

5 Session information

sessionInfo()
## R Under development (unstable) (2019-10-31 r77342)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/luna/Software/R/trunk/lib/libRblas.so
## LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] batchelor_1.3.8             scran_1.15.14              
##  [3] scater_1.15.12              ggplot2_3.2.1              
##  [5] scRNAseq_2.1.5              SingleCellExperiment_1.9.1 
##  [7] SummarizedExperiment_1.17.1 DelayedArray_0.13.1        
##  [9] BiocParallel_1.21.2         matrixStats_0.55.0         
## [11] Biobase_2.47.2              GenomicRanges_1.39.1       
## [13] GenomeInfoDb_1.23.1         IRanges_2.21.2             
## [15] S4Vectors_0.25.8            BiocGenerics_0.33.0        
## [17] BiocStyle_2.15.3           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] httr_1.4.1                    tools_4.0.0                  
##  [5] backports_1.1.5               R6_2.4.1                     
##  [7] irlba_2.3.3                   vipor_0.4.5                  
##  [9] DBI_1.1.0                     lazyeval_0.2.2               
## [11] colorspace_1.4-1              withr_2.1.2                  
## [13] tidyselect_0.2.5              gridExtra_2.3                
## [15] bit_1.1-14                    curl_4.3                     
## [17] compiler_4.0.0                BiocNeighbors_1.5.1          
## [19] labeling_0.3                  bookdown_0.16                
## [21] scales_1.1.0                  rappdirs_0.3.1               
## [23] stringr_1.4.0                 digest_0.6.23                
## [25] rmarkdown_2.0                 XVector_0.27.0               
## [27] pkgconfig_2.0.3               htmltools_0.4.0              
## [29] limma_3.43.0                  dbplyr_1.4.2                 
## [31] fastmap_1.0.1                 rlang_0.4.2                  
## [33] RSQLite_2.1.5                 shiny_1.4.0                  
## [35] DelayedMatrixStats_1.9.0      farver_2.0.1                 
## [37] dplyr_0.8.3                   RCurl_1.95-4.12              
## [39] magrittr_1.5                  BiocSingular_1.3.1           
## [41] GenomeInfoDbData_1.2.2        Matrix_1.2-18                
## [43] Rcpp_1.0.3                    ggbeeswarm_0.6.0             
## [45] munsell_0.5.0                 viridis_0.5.1                
## [47] lifecycle_0.1.0               edgeR_3.29.0                 
## [49] stringi_1.4.3                 yaml_2.2.0                   
## [51] zlibbioc_1.33.0               Rtsne_0.15                   
## [53] BiocFileCache_1.11.4          AnnotationHub_2.19.3         
## [55] grid_4.0.0                    blob_1.2.0                   
## [57] promises_1.1.0                dqrng_0.2.1                  
## [59] ExperimentHub_1.13.5          crayon_1.3.4                 
## [61] lattice_0.20-38               cowplot_1.0.0                
## [63] locfit_1.5-9.1                zeallot_0.1.0                
## [65] knitr_1.26                    pillar_1.4.3                 
## [67] igraph_1.2.4.2                glue_1.3.1                   
## [69] BiocVersion_3.11.1            evaluate_0.14                
## [71] BiocManager_1.30.10           vctrs_0.2.1                  
## [73] httpuv_1.5.2                  gtable_0.3.0                 
## [75] purrr_0.3.3                   assertthat_0.2.1             
## [77] xfun_0.11                     rsvd_1.0.2                   
## [79] mime_0.8                      xtable_1.8-4                 
## [81] later_1.0.0                   viridisLite_0.3.0            
## [83] tibble_2.1.3                  AnnotationDbi_1.49.0         
## [85] beeswarm_0.2.3                memoise_1.1.0                
## [87] statmod_1.4.32                interactiveDisplayBase_1.25.0