CSB PhD Candidate: Miriam Shiffman
Faculty Advisor: Tamara Broderick & Aviv Regev
Committee members: Stefanie Jegelka, Allon Klein, Ashia Wilson
Date: Tuesday January 16, 2024
Time: 12:00 PM EST
Thesis title: Uncertainty & robustness for single-cell studies: Could dropping a few cells change the takeaways from differential expression?
Abstract: The advent of new technologies capable of measuring molecular profiles at single-cell granularity, across thousands or millions of cells, offers unprecedented insight into the form, function, and circuitry of biological systems. At the same time, these technologies present particular statistical and computational challenges, including noise, sparsity, technical and biological variability, and multilevel sampling regimes. To distill relevant signal from biological phenomena, then, analyses must combine information in a careful and coherent way across cells. In light of these complexities, it is prudent that single-cell analyses incorporate notions of uncertainty and robustness to guide their interpretation and inform future decision making.
This thesis makes two main advances in facilitating coherent, actionable quantification of uncertainty and robustness for single-cell studies. First, we provide a framework for generalizability of differential expression analysis that—unlike common statistical tools (significance, power, standard error)—does not rely on the assumption that the sample in hand is independent and identically distributed as future samples. Instead, we posit an alternate (complementary) lens on generalizability: could dropping a very small fraction of cells meaningfully alter the basic conclusions of differential expression? We develop an accurate and efficient approximation to estimate this dropping-data robustness metric for the key outcomes of differential expression, for independent observation and pseudobulk analyses. Broadening these gene-level results to a high-level, biologically meaningful summary, we overcome the inherently non-differentiable and combinatorial nature of gene set enrichment analysis to provide an additional approximation for the dropping-data robustness of top gene sets. Applied to public single-cell RNA-seq data of healthy and diseased cells, our metric identifies widespread nonrobustness across genes that extends to high-level nonrobustness of top gene sets: half of the top 10 gene sets can be disrupted by dropping <1% of cells, and a few can be disrupted by dropping a single cell. The second part of this thesis provides a full Bayesian framework for reconstructing probabilistic trees of cellular differentiation from single-cell RNA-seq profiles. Most notably, motivated by the biology of differentiation and confronted with a lack of existing hierarchical models, we develop a new family of probabilistic trees where data is generated continuously along branches (and latent cell state evolves smoothly over the tree).
I close by reflecting on common themes relevant to uncertainty and robustness for single-cell studies, including interplay between the continuous and the discrete, the challenge of summarization, the importance of cyclical model criticism, and a possible way forward through differentiable and probabilistic programming.