Inside UniProt: December 2022

A conversation with machine learning engineer Andreea Gane

At UniProt we are very interested in engaging with the machine learning community in the area of computationally annotating the increasing volume of biological sequence data collected. Computational annotations allow us to explore these data at scale and help us understand the complexity of the biological world.

We are working with the group of Lucy Colwell and Maxwell Bileschi at Google Research who developed ProtNLM (Protein Natural Language Model) a natural language processing model that predicts name descriptions directly from a protein’s amino acid sequence. ProtNLM is currently annotating millions of entries which were previously only named 'Uncharacterized protein'.¹

Today we are talking with Andreea Gane. She is a research scientist at Google Research and a driving force behind the ProtNLM project.

Language is a complex system of communication and because it is adopted universally, it is readily open to scrutiny by the research community. Taking on the challenge of annotating millions of “uncharacterized proteins”, which we know very little about, might seem to many a rather daunting task. Could you tell us something about your background and how did you get interested in such a challenge?

I got excited about proteins when I joined Google in 2018. Around that time, the application of NLP techniques to challenging problems in biological sequence design started to gain popularity and I thought it would be an exciting area of research. I had the opportunity to work on building an algorithm for using variational autoencoders to generate new biological sequences and despite not having a background in biology, I thought I would be a good fit with my expertise in generative models. I enjoyed the project and I have worked in this research area since then.^{2, 3}

In regards to the ProtNLM project in particular, with the limited amount of labelled data for protein design tasks, I became interested in also modelling and leveraging the vast amount of naturally occurring proteins. The UniProt database is the go-to resource for proteins and metadata about them. What drew me to the ProtNLM project was the already large amount of high-quality protein names available and the huge potential impact of labelling uncharacterized proteins. Machine learning methods are great at leveraging patterns from vast amounts of data, so having a dataset like this made me excited about taking on the challenge.

Was there something in the development of machine learning methods that made you think this was a good moment to take on this challenge?

We were inspired by the huge success of transformer models in natural language processing and successful application of some of these techniques to modelling tasks in a protein space. And we were particularly inspired by the modelling paradigm in natural language processing of formulating problems as text-to-text tasks that can be solved jointly by training a single sequence to sequence model.^4,5,6So we first set out to train models to take in a domain amino acid sequence as input, and return various properties, all encoded as text. We trained models on Pfam to produce the one-line description corresponding to the Pfam family, in addition to the family accession label and the unique alphanumeric ID. Motivated by the model's ability to produce descriptions in this task, we set out to predict protein names for the full amino acid sequences.

Let’s talk about the model, how would you describe ProtNLM?

The ProtNLM model is a sequence-to-sequence model similar to a machine translation model in that it takes as input a sequence in one language and returns a sequence in another language. The simplest version of ProtNLM takes the amino acid sequence as the first language sentence and produces the protein name as an output.

Machine learning approaches require lots of data. What sort of data did you use for the ProtNLM training?

We train our models on the sequence-name pairs from the UniProtKB database. We use both UniProtKB/SwissProt and UniProtKB/TrEMBL. And, to reduce the low-quality names, we performed a name processing step, where we removed or normalised names that we thought were unsuitable for use as a training set target.

As from UniProt release 2022_05, ProtNLM has been updated and now also uses a Deep Ensemble, which combines the predictions of many models which have been trained independently. Could you tell us a bit more about this change? Is the model using the same type of data to create these predictions?

Since our previous release (uniprot 2022_04), we have also trained models that take in both the protein amino acid sequence and the name of the organism in which the protein was found. The organism typically is known even for uncharacterized proteins and can be informative because the organisms can provide information about the potential names. For instance, capsid proteins are typically associated with viruses, so using the organism name as an additional input during training and prediction can increase the likelihood of producing the name “Capsid protein” for a virus protein, and decrease the chance of wrongly assigning this name to proteins that occur in other organisms. In our experiments we found that using a Deep Ensemble, combining both types of methods (models that can use only the amino acid sequence and models that take in the organism information), improved performance. We have renamed proteins for which the new prediction had a higher model score than our previous approach, and we have also used the new approach for new uncharacterized projects.

Deep learning models are very powerful tools but it is often challenging for users to understand how much trust they should give to these annotations. As part of the release of the ProtNLM, your team has also prepared two accompanying Colabs. Links to these free Jupyter notebook environments are available from our Help page. These Colabs enable users to further explore the model using simple Python programming code instructions. How can users take advantage of these resources?

We provide two Colabs, one to use the model and one to view some evidence that we have extracted about the model accuracy.

The first one is a Colab that allows you to query one of the models in the Ensemble with a specific protein. It provides additional information to what can be found on the UniProt page. The first piece of additional information is the model score: this is an estimate of the likelihood the model assigns to the predicted name given the input protein, with a number closer to one indicating higher confidence. Second, the Colab provides the top 10 predictions rather than only the top prediction and seeing agreement among these top predictions can indicate a stronger signal.

The second one is a Colab to view evidence. So while we carefully evaluate the model, machine learning methods can produce predictions that look plausible but are incorrect. So in parallel to developing models to predict names, we are also developing methods to assess the accuracy of a particular prediction. The Colab provides a way for users to start gathering evidence to support the prediction. Of course, investigating what information is correct is a challenging task in itself, especially since we are dealing with natural language outputs. As a starting point, given that the predictions are short strings, we search for the string on the UniProt page corresponding to the protein and in the UniProt pages of related proteins.

What is the most challenging part of this project?

I would say that by far the biggest challenge is that a protein can have multiple names that are all correct, which means that it's difficult to evaluate the model when the prediction is different from the name found in UniProt. We have to do a lot of manual curation to corroborate predictions with human review and, additionally, when there's little information about the protein, it's hard even for the human to know if the model made any accurate prediction or whether it made any discovery.

If you could go back in time knowing what you know now, what would you have done differently?

We now have a much better understanding of the dataset than when we started. For instance, we have a clearer idea of which names are preferred over others in terms of meaningfulness and we also have a deeper understanding of the various sources for the existing UniProt names. This has allowed us to develop stricter selection criteria for names that we add to the training set for future models.

We're also continuing to develop strategies that use existing information sources to automatically verify the accuracy of a name where possible, and we anticipate that using this information to provide feedback to the model could help improve performance.

A question about the future. These deep learning approaches appear to improve quickly and take on more and more complex tasks. What do you think is your next challenge for this project?

First, there is still considerable work to be done for the naming problems and we are still excited about continuing to improve that task. But longer term, we are excited about predicting longer functional descriptions and eventually applying these methods for the inverse problem of designing a new protein sequence given a new textual description.

Thank you very much for your time.

Users can access a summary of the ProtNLM methodology in our Help page and explore all ProtNLM predictions here

References:

Andreea Gane, Maxwell L. Bileschi, David Dohan, Elena Speretta, Amélie Héliou, Laetitia Meng-Papaxanthos, Hermann Zellner, Eugene Brevdo, Ankur Parikh, Maria J. Martin, Sandra Orchard, UniProt Collaborators and Lucy J. Colwell, ProtNLM: Model-based Natural Language Protein, Preprint
Andreea Gane, David Belanger, David Dohan, Christof Angermueller, Ramya Deshpande, Suhani Vora, Olivier Chapelle, Babak Alipanahi and Lucy Colwell, A Comparison of Generative Models for Sequence Design – Google Research, Machine Learning in Computational Biology Workshop (2019)
Christof Angermueller, David Belanger, Andreea Gane, Zelda Mariet, David Martin, Dohan Kevin, Patrick Murphy, Lucy Colwell and D. Sculley, Population Based Optimization for Biological Sequence Design – Google Research, ICML 2020 (2020)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser and Illia Polosukhin, Attention is All you Need, NIPS 2017
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21.140 (2020): 1-67
Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, Andrea Gesmundo Scaling Up Models and Data with t5x and seqio https://doi.org/10.48550/arXiv.2203.17189

Thursday, December 15, 2022

How artificial intelligence can help us annotate protein names