Contents

1 Variable batch effects

It is possible for the batch effect to be variable in direction across different subpopulations without violating the assumption of orthogonality to the biological subspace. In such cases, the orthogonalization step performed by fastMNN() is not effective at resolving the kissing problem for subpopulations with batch vectors that that differ from the average batch vector. This results in incomplete mixing of batches within each cluster, which is usually harmless but not aesthetically pleasing. The solution is to increase k, ideally to the anticipated average size of each cluster.

2 Scatter in shared space

Consider the following scenario involving two batches A and B:

<-----> Biology
                        BBBBBBBB
^              b b b b  BBBBBBBB  b b b b
|                       BBBBBBBB
| Batch
|              AAAAAAA
v     a a a a  AAAAAAA  a a a a a
               AAAAAAA

With each batch, the population can be considered to be distributed with most cells in the center and some cells in the tails. The question is whether B should be merged with A or with a (and conversely, whether A should be merged with B or b). This leads directly to a difference in interpretation:

This is problematic as there is no clearly correct way to execute the merge. fastMNN() favors the second interpretation as the first choice would requiring merging along the biological subspace, which is unpalatable as it permits unintended removal of biological heterogeneity in other contexts. Nonetheless, the solution is to again increase k if deeper mixing is required.

3 Behaviour in the absence of a batch effect

An interesting consequence of the orthogonalization step is that fastMNN() may not work correctly in the absence of a batch effect. In such cases, the batch vector will be of near-zero length in some random direction. If this is parallel to the biological subspace, orthogonalization may subsequently end up removing geniune biology. This is a natural side-effect of the orthogonality assumption, which obviously fails if there is no batch vector in the first place. In practice, this is unlikely to be a major problem as a random vector is still likely to be orthogonal to any one biological dimension. If this is not the case, we should be able to observe a large loss of variance that indicates that fastMNN() should not be run.

fastMNN() can also be instructed to skip the correction if the relative batch effect size is below some threshold. The relative size is defined as the ratio of the L2 norm of the average correction vector to the expected L2 norm of the per-pair vectors. This is small if there is no batch effect as the per-pair vectors will point in different directions. If large losses of variance at particular merge steps are suspected to be caused by the lack of a batch effect, we recommend examining the relative effect sizes and picking a threshold that allows one to skip those steps.

Comments: