CSB Thesis Defense Samuel Goldman (Coley Lab)


Friday, Jan 12, 2024


11:00 am to 12:00 pm

Event Description: 

CSB PhD Candidate: Samuel Goldman

Faculty Advisor: Prof. Connor Coley (Chem E)

Date: January 12th, 2024

Time: 11:00 AM - 12:00 PM Eastern Time

Location: MIT Stata Center Room 32-155

Thesis title: Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra

Abstract: Small molecule metabolites mediate myriads of biological and environmental phenomena across host-microbiome interactions, plant chemistry, cancer biology, and various other processes. Mass spectrometry is often used as an analytical technique to investigate the small molecules present in a sample, measuring both their masses and fragmentation spectra. However, the complexity and high dimensionality of spectral data makes it difficult to identify unknown metabolites and their roles, with a large majority of detected metabolites remaining unidentified in public data.

This thesis proposes a suite of new computational methodologies for higher accuracy annotation of small molecule metabolites from mass spectrometry data that integrate chemistry-informed priors with modern deep learning advancements. I begin by decomposing and framing the metabolite annotation pipeline into four key tasks well-fit for supervised deep learning. To address these various tasks, I first introduce the a transformer neural network to predict molecular property fingerprints from spectra by changing the tandem mass spectrum input basis from scalar mass values to plausible molecular formula annotations. This method is then extended to an energy-based-model formulation to predict the molecular formula of an unknown molecule from its tandem mass spectrum. Following these initial efforts to learn better representations of fragmentation spectra, I develop new neural networks capable of generating fragmentation spectra from small molecules through two-step autoregressive modeling. I show how this can be accomplished by generating either molecular formula peaks or molecular fragment peaks. Altogether, this work outlines a path forward to a fully “neuralized” pipeline for the high throughput identification of small molecule metabolites and their functions.