Using Deep Learning to Predict Protein B-Factors

Overview

In this project, I had the opportunity to configure, modify, test, and train numerous machine learning & deep learning models for the purpose of solving an ongoing problem in bioinformatics: determining the positional variance of amino acids given its sequence. Our final model utilized the transformer architecture & pre-trained embeddings to achieve a Pearson Correlation Coefficient of 0.78, matching state-of-the-art approaches.

Background

Google's AlphaFold has been the most impactful advancement in bioinformatics in recent years. Using a complex ensemble of models, it is now able to take a sequence of amino acids to predict the structure of a protein with an incredibly high accuracy.

When used, their model generates a static representation of the protein, with each amino acid placed in its 'mean' position. However, each portion actually varies from this position over time:

AlphaFold Prediction

One measure of its variance is 'B-Factor', which can be obtained from X-ray Crystallography. Low B-Factors indicate well-ordered & stable structures, while high B-Factors indicate mobile & disordered structures.

The goal is, much like with AlphaFold, to be able to create a model with existing experimental data to predict these metrics with a deep learning model. These computational tools have become essential to aiding drug design and protein engineering, as well as to get a better understanding of mechanisms within cellular biology.

Technical Implementation

While less experiential data is available on this metric, we were able to find many accurate measures of protein B-Factor from the Protein Data Bank (PBD). Using similar filtering mechanism as a recent research paper, our final training dataset used 60,000 protein sequences—all of which have a variety of functions and biological origins.

My role for this project involved systemically testing a range of architecture choices on a subset of our data in order to inform our decisions in developing our final model. In total, I tested over 40 variations, including:

  • Linear models (linear regression, ridge regression, lasso regression)
  • Recurrent Neural Networks & LSTMs (number of layers, optimizers)
  • Transformers (head size, learning rate)
  • Variants of B-Factor Metrics

The transformer model performed best in my original testing—though our results were quite weak:

Initial model testing results

I modified the structure to utilize ProBERT embeddings—a pre-trained BERT model which gave high-dimensional embedding representations to the amino acids. This massively improved the accuracy of all models:

Improvement with ProBERT embeddings

Our full-scaled model used this design choice, achieving a Pearson Correlation Coefficient of 0.78 after training on 60,000 protein sequences:

Final model results

We also trained a full-scale LSTM model, confirming our hypothesis that the transformer architecture would outperform the LSTM architecture (which was the state-of-the-art at the time):

LSTM vs Transformer comparison

We found that our final model's accuracy varied immensely based on the protein it was predicting. For some proteins, it made nearly perfect predictions, while it made very inaccurate predictions on others. This indicates that, while our model has the potential to make precise predictions, its overall accuracy is diminished by outliers:

Outlier comparison
Outlier comparison
Outlier comparison

Special thanks to my partners, Marco Carbullido and Rafael Djamous, for their help with this project, as well as Professor Jihun Hamm and Professor Ramgopal Mettu for their valuable feedback.