Friday, May 14, 2021

Prioritizing curation – how do we decide which UniProtKB entries to manually annotate?


 

 

With over 500,000 proteins already in UniProtKB/Swiss-Prot and many hundreds of new papers being published each week, we are often asked how we prioritise which entries our expert curators manually update. We consider many factors when coming to these decisions. Our focus is determined by biomedical importance, funding, size of community and biological interest; the following list will help to explain our overall strategy.

                1. We prioritise the annotation of human proteins, scanning the literature to find novel, relevant papers from which to extract key data. A study we published in 2017 showed that, for human, >50% of relevant papers available in PubMed each year were captured in UniProtKB/Swiss-Prot entries [1].  We update functional annotation, including novel structural, post-translational modification, interaction and enzymatic activity data, and add new sequence variants with functional impact and links to disease. At the same time, we also attach Gene Ontology terms to the proteins, with the UniProt curators providing much of the human GO annotations.

                2. We annotate closely related model organism proteins, which enables us to capture details of processes currently less well studied in human systems, such as embryonic development. We also cover more distantly related model organisms relevant not only to human biology but also to that of pathogens - C. elegans informs us on the biology of parasitic worms, Drosophila melanogaster on malaria vectors and yellow fever mosquito. We are careful to ensure that we do not duplicate efforts of the Model Organism Databases.

                3. We curate proteins of a broad range of microbial species, covering fungi, bacteria, and viruses, with a focus on key microbial models and pathogens such as Cryptococcus neoformans, Escherichia coli and Bacillus subtilis. We curate complete viral proteomes, selecting representatives of important classes of viruses for manual annotation. It was the expertise of this group which enabled a rapid response to the current pandemic and enabled the rapid annotation of the SARS-COV-2 proteome upon publication of the first genome sequence. We identified a number of unannotated ORFs that were subsequently proven to play a role in viral infection and have continued to update the SARS-Co-2 proteome (as we do all proteomes) as new discoveries are made. We also focus on microbial metabolism, including natural product synthesis.

                4. We curate proteins of plants, including the dicotyledon Arabidopsis thaliana and monocotyledon Oryza sativa. We also capture data on processes not represented in these model organisms, for example the nitrogen fixation pathway of Medicago truncatula, and the metabolic pathways for plant natural products.

                5. We curate non-model organism proteins when papers describing key processes relevant to human health and disease are published.

The scientific literature increases with every year, and we use a range of approaches to find relevant literature, including advanced machine learning approaches for literature prioritization that are trained on our own corpus of curated literature [2]. We also look to our user community to alert us when an important paper is published and have recently established a mechanism by which users can more directly contribute to our annotation effort by adding publications to protein entries [2,3]. We encourage you to visit https://community.uniprot.org/bbsub/bbsub.html? and contribute to our critical work.

 

References

1.

2. Allot A, Lee K, Chen Q, Luo L, Lu Z. LitSuggest: a web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res. 2021 doi:10.1093/nar/gkab326

3. Lee K, Famiglietti ML, McMahon A, Wei CH, MacArthur JAL, Poux S, Breuza L, Bridge A, Cunningham F, Xenarios I, Lu Z. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol. 2018 Aug;14(8) e1006390. doi:10.1371/journal.pcbi.1006390