Monday, September 7, 2020

Association-Rule-Based Annotator (ARBA) in UniProt

UniProt has developed an automatic annotation system to enhance unreviewed TrEMBL entries in the UniProt Knowledgebase (UniProtKB) by enriching them with automatically predicted annotations. In release 2020_04 of August 2020, a new powerful automated system called ARBA replaced the previous SAAS (Statistical Automatic Annotation System) system.  ARBA is a multiclass learning system trained on expertly annotated entries in UniProtKB/Swiss-Prot. ARBA uses rule mining techniques to generate concise annotation models with the highest representativeness and coverage for annotation, based on the properties of InterPro group membership and taxonomy.





ARBA currently generates around 23 thousand models, resulting in annotations for more than 85 million proteins including 35 million that lacked any previous annotation. Consequently, UniProtKB witnessed an increase in automatic annotation coverage from 35% to 50%. All ARBA rules can be accessed here and relevant rules are also tagged as evidence for annotations from UniProtKB entries.

ARBA-based evidence for UniProtKB annotation

When an annotation is added to an entry based on an automatic annotation from an ARBA rule, the evidence tag indicates this along with a link to the rule itself, for example, the protein entry Q4SML2  derives annotation from ARBA rule ARBA00000621.






Browsing ARBA rules

In order to browse the dataset to view rules of interest, click on the dropdown next to the search box in the UniProt website and select ‘ARBA’. Now enter a query and hit the search button.


Exploring ARBA rule pages

Conditions are listed on the left-hand side of the rule page and annotations are on the right-hand side. If a condition holds true, then the corresponding annotation is applied.

ARBA annotation data is recalculated for every UniProt release to ensure that the annotations are accurate and up-to-date.



Tuesday, August 4, 2020

UniProt COVID-19 portal: Supporting research during the pandemic

 

Responding to the urgency of the pandemic, UniProt created and is continuing to develop a dedicated portal to provide access to the latest pre-release annotations and sequences for proteins related to COVID-19. It is released independently of UniProt’s 8 weekly release schedule. It can be accessed via https://covid-19.uniprot.org/ and all sequences can also be downloaded directly via our FTP site ftp://ftp.uniprot.org/pub/databases/uniprot/pre_release/.




An integrated source of sequence, function and links to specialist resources

 

The portal provides SARS-CoV-2 annotated protein sequences, closest SARS-CoV 2003 sequences and human sequences relevant to the biology of viral infection. The SARS-CoV-2 proteome is annotated based on expert curation of literature and the knowledge extracted from the well-studied SARS-CoV virus. Rule-based automatic annotation also allows us to add information from a broader taxonomic range of viruses. Links to structures, drugs, interactions, molecular pathways as well as many other resources provide integrated information to help understand the biology and investigate routes to treatment.




 

The annotated UniProtKB entries include functional and positional annotations. The microbial infection information and essential positions and structures for the virus infection are also documented in these records. Each protein entry provides annotations such as the catalytic activity and function, Gene Ontology terms, 3D structures, interactions, external links to resources like IntAct, ChEMBL, DrugBank, PDBe-KB, etc, and the ProtVista visualisation of positional annotations on the sequence space. Within entries, the mature products that result from proteolytic cleavage of precursor proteins can be identified with UniProt product identifiers.


Contribute and explore literature about COVID-19




The portal provides access to the latest literature related to the virus and host protein through a link to LitCovid and a link to UniProt’s community literature submissions. Users can also contribute relevant publications through the ‘Add a publication’ link present in each entry.

 

Tuesday, April 14, 2020

Scientists at home, UniProt to the rescue!

Many of you that work in the lab have switched to working remotely. Though your daily routine and the continuity of your research might have been impacted, your contribution to knowledge can continue in new ways.
Are you at home itching to contribute to science?  UniProt to the rescue!
Improve our resource for the community and receive credit for it.
We have the proteins and you have the expertise. You can now use that expertise by adding publications to protein entries.

What you need:

1.     ORCID, this is your researcher personal ID (used for validation and for credit)
2.     a protein of interest
3.     a publication with a PubMed ID (PMID) about the protein of interest. You don’t have to be the author of the publication
What to do (Figure 1):
1.     Identify the protein of interest in UniProt (note that this also includes proteins from the special UniProt COVID-19 website, which can be found at https://covid-19.uniprot.org/uniprotkb?query=*)
2.     Select “Add a publication” link on the top menu in the entry page
3.     Login with ORCID
4.     Fill in submission form
a.     Enter PubMed ID (PMID) to retrieve publication
b.     Confirm that the publication is correct and it is about the protein of interest
c.     Select what topics the paper is about
d.     Add short statements about protein name, function, disease, or other, as described in the publication
e.     Submit
5.     Reply to review questions, if any
6.     After review, check your publication on the website in next release




Figure 1-From publication to UniProtKB entry.

A sample blank submission form can be found here:

This is how your publication will be displayed on the UniProt entry publication page, under community
https://www.uniprot.org/uniprot/O58649/publications?query=&fil=Community with your ORCID as the contributing source for the publication and information.

Publications submitted can be tracked here

Follow the growth of contributions:

Learn more here:


Friday, March 13, 2020

To be or not to be an enzyme: pseudoenzymes in UniProt


Enzymes are essential for many biological processes. Without them, common tasks such as digesting food or replicating DNA would not be possible.
In recent years, and in part triggered by the expansion of the analysis and annotation of complete genomes, it has become apparent that several enzyme families in a wide range of species contain members that look like enzymes but fail to behave like enzymes. For example, in human, several of these families have between 5 to 10% of these enzyme-like proteins. Whilst these proteins have sequences and 3D structure features similar to active enzymes, they tend to lack essential amino acid residues such as those involved in catalytic reactions and/or binding substrates, making them incapable of catalysing chemical reactions. Based on these characteristics, scientists decided to call them pseudoenzymes.
Why are genes coding for pseudoenzymes maintained in the genome? It turns out that, despite their lack of enzymatic activity, this group of proteins carries out essential functions in cells. For example, they help assemble signalling cascades by acting as scaffolds, they regulate the activity of other enzymes and ensure that proteins are localized to the right cellular compartment. Consequently, they have become potential targets for the design of therapeutic treatments.
To support the growing interest in pseudoenzyme biology, UniProt recently revisited this important group of proteins. In collaboration with the pseudoenzyme community, we implemented changes to enhance their identification and discoverability. The outcome of this project was published in two articles in Science signalling and FEBS journal .

Ultimately, this effort will provide the scientific community with a comprehensive resource for pseudoenzymes, which in turn will lead to a better understanding of the evolution of these molecules and their active counterparts and the aetiology of related diseases. It will also support the ongoing quest to target pseudoenzymes for therapeutic treatments and offer some insight into the expanding field of enzyme engineering.