fastMNN
algorithmThe fastMNN()
approach is much simpler than the original mnnCorrect()
algorithm, and proceeds in several steps.
We first perform a PCA across all cells in all batches to reduce dimensionality. This decreases the size of the data for further analysis - in particular, improving the speed of nearest-neighbour detection. It also removes high-dimensional technical noise that can interfere with nearest-neighbour detection. We stress that this step has no role in the batch correction itself, and indeed, we would expect the first few PCs to be dominated by the batch effect. We also note that this does not compromise the validity of the MNN approach, which is based on distances between cells. Provided enough PCs are taken (default d=50
), the distances between cells in the PC space can approximate well the distances in the original space.
The procedure itself is as conceptually simple as cbind
ing all datasets together and performing a PCA on the merged dataset. The only modification to the procedure is to ensure that each batch contributes equally to the basis vectors. Specifically, the mean vector used for centering is defined as the grand mean of the batch-specific mean vectors; and the contribution of each batch to the gene-gene covariance matrix is divided by the number of cells in each batch. This ensures that batches with large numbers of cells do not dominate the PCA calculations. All cells are then projected into the PC space using the identified basis vectors.
An interesting question is whether orthogonality is preserved in the low-dimensional subspace. Consider the batch effect vector \(W\) and the biological subspace defined by the column vectors of \(\mathbf{B}\) (for simplicity, we will assume that all vectors are unit length). In the original expression space, we have assumed that \(W\) is orthogonal to \(\mathbf{B}\), i.e., \(W^T\mathbf{B} = 0\). This ensures that MNN pairs are correctly identified between corresponding subpopulations in different batches, as described in the original paper.
Now, consider a projection matrix \(\mathbf{V}_k\) from the PCA, corresponding to the first \(k\) PCs. The key assumption is that \(\mathbf{V}_k\) “captures” the entirety of \(W\), i.e., there is some linear combination \(A\) of the column vectors of \(\mathbf{V}_k\) such that \(\mathbf{V}_{k} A =W\). Note that this means that \(A = \mathbf{V}_k^TW\), due to the orthonormality of the column vectors of \(\mathbf{V}_k\).
If we project everything into the subspace defined by \(\mathbf{V}_k\), the rotated biological subspace becomes \(\mathbf{B}_k = \mathbf{V}^T_k \mathbf{B}\). The rotated batch vector becomes \(W_k = \mathbf{V}_k^TW\), which is equal to \(A\). This means that
\[ W_k^T \mathbf{B}_k = A^T \mathbf{V}_k^T \mathbf{B} = (\mathbf{V}_k A)^T \mathbf{B} = W^T\mathbf{B} = 0 \;, \]
i.e., orthogonality is preserved in the PC space. This means that it is valid to identify MNN pairs in the PC space, provided the assumption above holds.
The use of mutually nearest neighbours is largely the same as described in the original manuscript and for mnnCorrect()
. As before, MNNs will be identified between cells of the same type or state across batches, assuming that the batch effect is orthogonal. The only modification is that nearest neighbour identification is now performed on the PC space rather than the original gene expression space. This is perfectly valid if one treats Euclidean distances in PC space as an approximation of the distances in gene expression space.
The choice of k
should be driven by the minimum subpopulation size in each batch, with a default of k=20
. Larger k
will improve the stability of the correction vectors and provide some robustness to non-orthogonality. Specifically, slight non-orthogonality can cause MNN pairs to be defined at the boundaries of the population, which results in incomplete merging as only the boundaries are brought into contact. However, too large k
will cause MNN pairs to be identified between the incorrect populations. This will obviously result in incorrect merges.
Of course, it is (again) possible that MNNs are incorrectly identified between cells of different type or state across batches. This would require the presence of unique subpopulations in each batch, which are closer to each other than to subpopulations that are shared across batches. These two distinct subpopulations (i.e., that are genuinely separated on the biological subspace) would be incorrectly merged by fastMNN()
. Such a scenario seems somewhat unfortunate. Even with manual curation, it would be difficult to consider these subpopulations as distinct cell types in the presence of an arbitrary batch effect.
Comments:
Once MNN pairs are identified, the correction vector for each paired cell in the target batch is computed. If a paired cell is involved in multiple MNN pairs, its correction vector is defined as an average across all of its pairs. The average batch vector is then computed by averaging across the correction vectors for all paired cells. This represents an estimate of the overall batch effect.
We then project all cells in the target batch onto the average batch vector, yielding a per-cell component in the direction of the average batch vector. Any variation in the components represents uninteresting technical noise, assuming orthogonality between batch and biological effects. This is eliminated by adjusting the cell coordinates so that the components of all cells are equal to the mean value within the target batch. We repeat this for the reference batch.
Note that this step is not the batch correction, we are simply removing variation within each batch. The aim is to avoid the “kissing” problem for dense subpopulations, whereby MNNs are only identified on the surface of each subpopulation. In such cases, subsequent correction will fail to fully merge subpopulations as the correction vectors only bring the surfaces into contact. By removing variation along the batch vector, we can avoid this problem as the subpopulations no longer have any “width” in either batch.
That said, the use of the average batch vector assumes that the batch effect has the same direction in all subpopulations. This ignores variation in the batch directions across subpopulations, so some kissing may still be expected when this variation is present. In such cases, the only way to improve the correction is to increase k
to allow the correct MNNs to be identified beyond the population surface.
fastMNN()
will also compute the percentage of variance removed by this orthogonalization procedure. This is done for both the target and reference batches. If a high percentage of variance is removed, this suggests that there is biological structure that is parallel to the average batch vector. Orthogonalization will subsequently remove this structure, which would not be appropriate. In this manner, we can use the percentage of variance removed as a diagnostic for the orthogonality assumptions of the MNN procedure.
Of course, what constitutes a “large” loss of variance is debatable. A large percentage of lost variance may be harmless (and even beneficial) if it removes uninteresting noise. A small percentage may be damaging if it removes a rare population. Nonetheless, this diagnostic provides a quick method for flagging problematic batches.
If one is concerned about any loss of variation due to orthogonalization, we suggest comparing the merged analysis to separate analyses with individual batches. Loss of important variation should manifest as, e.g., co-clustering in the merged analysis of very different cells.
Orthogonalization only applies to the batches currently being merged, e.g., \(A\) and \(B\). If another batch \(C\) is to be merged later, there will still be non-zero variation within \(C\) along the \(A\toB\) correction vector. This will not be removed if the correction vector for \(C\to \{A, B\}\) is in a different direction. Thus, we would introduce an artificial difference in the variance structure among the different batches.
In practice, this is unlikely to be a major problem as the variation along the correction vectors should be small anyway. Nonetheless, we make sure to remove variation along the correction vector at each step for all batches, i.e., “reorthogonalization”. This has the advantage of being able to diagnose large losses of variances due to non-orthogonality, which would otherwise been ignored if the correction vector at each step happened to be orthogonal to the batches being merged at that step.
A simple 2D example of the above effect would be a situation where cell populations are arranged as corners of a square. Two adjacent corners form one batch while the other two corners form their own separate batches. Depending on the merge order, one could merge the two separate corners first, and then merge that onto the largest batch. This is a clear case of non-orthogonality because the first correction vector is parallel to the biological variation in the largest batch. If we did not reorthogonalize, this non-orthogonality would not be diagnosed and we would be misled in the resulting analysis.
The same argument applies for hierarchical applications of fastMNN()
. Say we have a current call of fastMNN()
containing multiple inputs, where each input contains the output of a previous fastMNN()
call. The function will collect the correction vectors used for each merge step in the previous calls, and then reapply them to all inputs in the current call. This ensures that the (removal of) variation is consistent across inputs prior to further correction.
Comments:
For each cell \(i\) in the target batch, we identify the k
nearest neighbouring paired cells, i.e., cells in the same batch that are involved in a MNN pair. The correction vector for cell \(i\) is defined as a locally weighted average of the correction vector of the neighbouring paired cells. The weighting is done using a tricube scheme, where the bandwidth is defined as ndist=3
times the median distance to the k
neighbours. This favours neighbours that are closer to \(i\) and provides some robustness against cells in different subpopulations (e.g., if subpopulation to which \(i\) belongs is small).
Cells in the target batch are then batch-corrected by subtracting the correction vector from the coordinates in the PC space. The newly corrected cells are merged with the reference batch, and the entire process is repeated with a new batch. Note that the PCA step is only done once at the start, though.
We average the biological components across all batches for each gene and select genes with high average biological components. This statistic is responsive to batch-specific HVGs while still preserving some information about the within-batch ranking of genes. In particular, for genes with consistent biological components across batches, taking the average will just converge to the true value, which is appealing.
The rationale for this approach is even stronger if we were to take all genes with positive biological components. Consider a gene for which the null hypothesis is true, i.e., there is no biological variability such that the total variance is equal to the technical component determined by the mean-variance trend. The estimate of the variance will fluctuate around the true value, meaning that this gene will have a positive biological component ~50% of the time for a single batch - this errs on the side of retaining more genes to capture more signal. Notably, this remains true even after taking the average (provided that the null hypothesis is true for all batches), such that we can generally expect the number of retained genes and the signal-noise trade-off to be constant regardless of the number of batches.
For comparison, let us consider another approach where we take the union of the features selected from each batch without computing the average. This leads to increasing number of genes being retained with an increasing number of batches, which defeats the purpose of feature selection. We could mitigate this effect by adjusting the number of genes retained per batch, but this is relatively unintuitive and unpredictable as the final number of genes depend on how many genes in the per-batch HVG sets are shared. Performing an intersection has the opposite effect and is obviously detrimental if the batches have different factors of variation.
Using a single core, the fastMNN()
function completes in several minutes on merging the 68K PBMC droplet dataset with the 4K T cell dataset. Most of the time is taken up by computing the cross-product for the initial PCA, which can be (but is not yet) easily parallelized. Memory usage is minimal through the use of the DelayedArray framework, which avoids creating the merged matrix explicitly for PCA. The nearest neighbour search is performed using the BiocNeighbors package, which provides a speed boost over conventional KD-trees for high-dimensional data (see https://github.com/LTLA/OkNN2018) and also supports parallelization.