The Era of Machine Learning for Protein Design, Summarized in Four Key Methods
Protein design and engineering are essential goals in molecular biology, with wide-ranging applications in various fields including medicine, biotechnology, and materials science. Scientists have been exploring already for some decades various approaches to designing novel proteins and engineering existing ones to fine-tune their properties. While physics-based approaches have had some success in finding amino acid sequences that fold to a given protein structure, the recent developments in deep learning methods have shown much higher success rates and versatility. In this article, I will outline four notable machine learning (ML) tools for protein design and engineering and their significance in advancing the field.
Beyond the impact that these tools will immediately have in the chemical and biological sciences, the methods they introduce and even the projects themselves offer exciting opportunities for data scientists, machine learning practitioners, and AI researchers to think new ideas and ways to collaborate with chemistry and biology scientists who can end up applying computer science for good. Indeed, the tools I will discuss below demonstrate the power of applying different kinds of deep learning algorithms to tackle a particularly complex challenge in biotechnology. By leveraging these tools, professionals in data science, machine learning, and Artificial Intelligence can thus also contribute to advancements in medicine, biotechnology, and materials science, witnessing direct impact of their own field of expertise even outside of it!
In a nutshell, I will present the tools called ProteinMPNN, ESM2-InverseFold, RoseTTaFold Diffusion, and MASIF-Seed, in order of release. Importantly, all these models started to become famous after Deepmind's disruption into the structural biology field with its AlphaFold model:
Over a year of AlphaFold 2 free to use and of the revolution it triggered in biology
ProteinMPNN
ProteinMPNN, developed by the Baker lab, is the first-ever ML tool for protein design that got published with experimentally tested designed proteins.
This model is based on an encoder-decoder neural network and is the first such tool shown to generate protein sequences experimentally verified to fold as intended. Two papers, "Robust deep learning–based protein sequence design using ProteinMPNN" and "Hallucinating symmetric protein assemblies," published in Science in late 2022, demonstrate the methodology (former paper) and applicability of the tool to various protein design problems (later paper).
I dedicated one specific blog post to ProteinMPNN, and anyway this one is already kind of "old" (despite being published less than a year ago, demonstrating how fast the field evolves!). So I won't cover more of it here, and you can check out my previous article:
New deep-learned tool designs novel proteins with high accuracy
ESM-InverseFold
Developed by Meta, ESM2-InverseFold is based on the ESMFold protein language model, but engineered to generate protein sequences from structures rather than predicting structures from sequences.
ESMFold was found to produce highly diverse protein sequences well outside the known universe of natural sequences. The preprint "Language models generalize beyond natural proteins" describes its core functioning and presents several examples of successful designs.
To know more about ESMFold, check out my previous post:
How Huge Protein Language Models Could Disrupt Structural Biology
And here's the preprint of the protein design tool engineered from it, "ESM-InverseFold":
ESM-InverseFold is a protein design tool that uses machine learning to generate de novo proteins that have never been seen in nature. The tool is based on language models that have been trained on millions of diverse natural proteins across evolution using masked language modeling. These models generate motifs that link sequence to the design of the structure and can apply them in new sequence and structural contexts. ESM-InverseFold offers two generative protein design tasks: fixed backbone design and free generation. Fixed backbone design involves generating protein sequences by taking low temperature samples from the conditional distribution specified by the language model via Markov chain Monte Carlo with simulated annealing. Free generation removes the constraint on structure entirely and generates new proteins by sampling from the joint distribution of sequence and structure specified by the language model. ESM-InverseFold has shown high experimental success rates, producing a soluble and monomeric species by size exclusion chromatography in 67% of the evaluated proteins. As the authors show, the language model used in the tool is able to access a design space beyond that of natural proteins, generating novel solutions based on deep patterns of protein design, including structural motifs found in natural proteins.
RoseTTAFold Diffusion
RoseTTAFold Diffusion, based on diffusion models, is the latest tool from the Baker lab, also preprinted in the bioRxiv.
Broadly applicable and accurate protein design by integrating structure prediction networks and…
From the Baker's lab blog, this is currently the top-performing method from the Rosetta suite for protein design:
RoseTTaFold Diffusion is a generative model based on a denoising diffusion probabilistic model that uses deep learning to generate diverse, complex, and functional proteins from simple molecular specifications. It fine-tunes the RoseTTaFold structure prediction network on protein structure denoising tasks to obtain a generative model of protein backbones. RoseTTaFold Diffusion generates protein structures by simulating the noising process for a random number of steps on structures sampled from the Protein Data Bank during training. The method generates new protein structures by transforming noised coordinates from the previous step into predicted structures, conditioned on inputs to the model, which can include partial sequence, fold information, or fixed functional motif coordinates. The method was trained using two different strategies: 1) in a manner akin to "canonical" diffusion models, with predictions at each timestep independent of predictions at previous timesteps, and 2) with self-conditioning, where the model can condition on previous predictions between timesteps. RoseTTaFold Diffusion can generate protein structures either without additional input or by conditioning on various inputs, and it can generate diverse protein structures with little overall structural similarity to any known protein structures. The method outperforms other deep learning methods for protein structure generation and has been shown to have state-of-the-art performance across a broad set of design challenges, including protein monomer design, protein binder design, symmetric oligomer design, enzyme active site scaffolding, and symmetric motif scaffolding for therapeutic and metal-binding protein design.
MaSIF-seed
MaSIF-seed, joint work by Michael Bronstein‘s lab and the Correia lab for protein design at my institution (EPFL Extension School) and published in Nature this month, specializes in designing protein interactions via learned protein surface fingerprints:
De novo design of protein interactions with learned surface fingerprints – Nature
This tool has demonstrated impressive performance in designing protein monomers and oligomers, including target-binding proteins and folds unseen in nature. It grows on the own groups' previous work, Masif, a ML tool that predicts interactions from surface features.
A surface-centric approach compared to the other methods, Masif-seed focuses on the surface properties of proteins and the interactions between surface patches. Its neural network outputs vector fingerprint descriptors that are complementary between patches of interacting protein pairs and dissimilar between non-interacting pairs. The matched surface patches are aligned to the target site and scored with a second neural network, which outputs an interface post-alignment score to further improve the discrimination performance of the surface descriptors. MaSIF-seed has shown superior performance in discriminating true binders from decoys on the basis of rich surface features, compared to other tools. In addition, it is presumably faster and more accurate than other methods.
The paper presenting the method describes several examples where this tool was used to design de novo protein binders to engage challenging and disease-relevant protein targets. The full protein design pipeline using MaSIF-seed involves several steps, from identifying target sites on the protein with a high propensity to be engaged by protein binders, then searching a subset of a database of surface fingerprints derived from fragments to find binding seeds that could target the selected site, and then transplanting them onto protein scaffolds that are compatible with the binding modes of the seed using specialized Rosetta protocols. Finally, the binder interface is optimized, and in practical applications the designs are screened experimentally to fine-tune the final sequences via mutagenesis libraries.
Designing protein sequences to fold and work as scientists need
In all four tools, the input to the model is a backbone structure, possibly with certain amino acid identities restrained, onto which the models craft the protein sequence that is expected to fold as intended. While these models can generate sequences of interacting proteins, they cannot natively consider non-protein molecules in the design process. This limitation hampers their application to designs involving binding to non-protein molecules unless the user specifically fixes certain residues manually based on the desired function. Although somewhat inefficient because it requires knowledge of the system of interest, such strategy has already worked in the design of an enzyme by the Baker lab, earlier in 2023:
Just like in that example, the development of these tools has opened up exciting possibilities for designing novel proteins and engineering existing ones. These tools are particularly useful in the development of therapeutics, materials science, and Biotechnology, where the properties of proteins can be finely tuned to specific needs. The ability to generate protein sequences experimentally verified to fold as intended has enormous implications for the development of new treatments and therapies, particularly for complex diseases. See for example this special kind of vaccine-like preparation consisting in a mixture of protein epitopes designed in the computer -at the moment with more traditional physics-only tools.
Furthermore, these tools have the potential to significantly reduce the time and resources required for protein design and engineering, making it a more accessible field of research. And they are much easier to deploy and run, which again helps to democratize their use. Indeed, see how easily you can adapt regular ESMFold to analyze realistic protein designs that could come for example from ProteinMPNN running on HuggingFace, just inside your web browser:
A web app to design stable proteins via the consensus method, created with JavaScript, ESMFold…
To conclude, we can without any hesitation claim that, after the hype on protein structure prediction with AlphaFold, we are now in the hype wave for protein design, with new methods showing up every month or so in average, and of which I presented here the four that I consider to be most relevant at the moment -mainly because they have all been tested experimentally.
These new models for protein design are showing impressive results, and will without doubt be an essential part of protein biotechnology labs and companies in the near future. While limitations still exist, the potential applications of these tools are enormous, and they are expected to have significant implications for medicine, biotechnology, and materials science in the years to come.
Related articles
For an overview of how computer modeling, simulations, and artificial intelligence impact protein engineering, check this out:
How computer modeling, simulations, and artificial intelligence impact protein engineering in…
In this other article I explore why the problem of protein design/engineering is so difficult, even when targeting a single residue:
Paper summary: Why is it so difficult to predict the changes in stability that result when a…
You may also find interesting my article about balancing quality and quantity in Machine Learning for science, where I touch specifically on points related to ML models for protein design:
"ML-Everything"? Balancing Quantity and Quality in Machine Learning Methods for Science
www.lucianoabriata.com I write and photoshoot about everything that lies in my broad sphere of interests: nature, science, technology, programming, etc. Become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and subscribe to get my new stories by email. To consult about small jobs, check my services page here. You can contact me here.