CSB PhD Candidate: Kayla McCue
Research Advisor: Prof. Chris Burge
Date: Tuesday, December 19, 2023
Time: 9:00 -10:00 AM
Room: 68-121
Title: Interpretable Computational Modeling of pre-mRNA Splicing for Multiple Eukaryotic Species
Abstract: One of the key steps in eukaryotic gene expression is pre-mRNA splicing, whereby intronic sequences are excised from immature pre-mRNA transcripts, and the remaining exonic sequences are joined together. This process is catalyzed by the spliceosome, a large complex of proteins and RNAs. A variety of RNA sequence features influence this process, including the core splice site (SS) motifs and splicing regulatory elements (SREs), which recruit protein splicing factors. Together these RNA elements and factors form an intricately interconnected regulatory system which is still incompletely understood. In this thesis, I describe SMsplice, an interpretable computational model of splicing that seeks to improve the understanding of how sequence elements influence the splicing pattern of pre-mRNA transcripts in a variety of eukaryotic organisms. SMsplice incorporates three key aspects of the splicing process: scores of potential SS motifs, scores of SS-proximal hexamers representing SREs, and structural preferences of the spliceosome for particular exon and intron lengths. We iteratively learn the SRE scores within this framework and assess performance by comparing the predicted splicing pattern of a transcript to a canonical pattern to calculate the F1 score, the harmonic mean of precision and recall. Our best-performing SRE scores yield performances of 70% in human, 73% in mouse, 86% in zebrafish and Drosophila melanogaster, 83% in silkworm moth, and 85% in Arabidopsis thaliana. Applying SMsplice to multiple organisms enables a variety of evolutionary analyses. Comparing the relative contributions of the SS scores, SRE scores, and the structural preferences revealed an increased dependence on SREs in lineages with longer introns, particularly mammals. Exonic regulatory information flanking real versus decoy SS was on average more discriminative than intronic regulatory information for all metazoans studied. In Arabidopsis, intronic and exonic SREs played comparable roles, suggesting a greater role for intronic information in plants compared with animals. Motifs generated from the hexamers with the strongest SRE scores recapitulated known splicing regulator binding sites in multiple organisms and a majority of the human motifs were significantly associated with splicing quantitative trait loci, including novel as well as known motifs. Furthermore, many of these motifs are common to all of the organisms tested, suggesting that aspects of splicing regulation are deeply conserved. This notion was further supported by the observation that using the SRE scores learned for one organism within the SMsplice model for another organism generally performed well. A notable exception was that SRE scores learned in mammals performed fairly well in non-mammals, but not vice versa, which may reflect the evolution of mammalian-specific splicing regulation alongside the lengthening of introns. This thesis demonstrates the utility of interpretable models of splicing, which allow for comparative analyses of features between organisms.