Date: Wednesday, May 7, 2025
Time: 1:00 -2:00 PM
Room: E15-359
CSB Ph.D. Candidate: Vikram Sundar
Supervisor: Kevin M. Esvelt (Associate Professor of Media Arts and Sciences, NEC Career Development Professor of Computer and Communications, Media Lab)
TDC Members: Amy Keating, Jeff Gore, Sergey Ovchinnikov, Debora Marks (external)
Title: Engineering TEV Protease Specificity: An Exploration of Machine Learning and High-Throughput Experimentation for Protein Design
Abstract: Engineering sequence-specific proteases would enable a wide variety of therapeutic applications in diseases ranging from cancer to Parkinson’s disease. However, many previous experimental and physics-based attempts at protease engineering have failed at engineering specificity in cleaving alternative substrates, rendering them useless. In this thesis, we aim to engineer TEV (tobacco etch virus) protease, a highly sequence-specific protease, to cleave alternative substrates. We incorporate novel high-throughput assays and powerful machine learning (ML) methods for highly effective protein engineering. The first portion of this thesis focuses on generating fitness landscapes from high-throughput experiments. Most machine learning models do not account for experimental noise, harming model performance and changing model rankings in benchmarking studies. Here we develop FLIGHTED, a Bayesian method of accounting for uncertainty by generating probabilistic fitness landscapes from noisy high-throughput experiments. We demonstrate how FLIGHTED can improve model performance on two categories of experiments: single-step selection assays, such as phage display, and a novel high-throughput assay called DHARMA that ties activity to base editing. FLIGHTED can be used to generate robust, well-calibrated fitness landscapes, and when combined with DHARMA, our methods enable us to generate fitness landscapes of millions of variants. We then evaluate how to model protein fitness given a fitness dataset of millions of variants. Accounting for noise via FLIGHTED significantly improves model performance, especially of high-performing models. Data size, not model scale, is the most important factor in improving model performance, and the choice of top model architecture matters more than the protein language model embedding. The best way to generate sufficient data scale is via error-prone PCR libraries; models trained on these landscapes achieve high accuracy. Using these methods, we successfully engineer both activity on an alternative substrate and specificity when compared to the wild-type. The ML-designed variants outperform anything found in the training set, demonstrating the value of machine learning even with experimental libraries of millions of variants. However, our results are limited to relatively close substrates. How best to improve model performance on distant substrates remains an open question.