Monday, October 10, 2016

Automatic learning based annotation in UniProt

Have you ever wondered how data mining and machine learning techniques might help in knowledge curation? Let us introduce you to the Statistical Automatic Annotation System (SAAS) in UniProt!

UniProt has an automatic annotation project that enhances unreviewed TrEMBL entries in the UniProt Knowledgeable (UniProtKB) by enriching them with automatically predicted annotations. SAAS is one of the systems that contribute to this project. 

SAAS is an automatic system with quality validation input from curators, such as exclusion of some data types as not appropriate for propagation. It learns on the properties present in the reviewed UniProtKB (Swiss-Prot) entries and uses the following attribute types to define the learning entries: InterPro protein family, taxonomy and sequence length. This combination allows SAAS to generate rules to annotate protein properties such as function, catalytic activity, pathway membership, subcellular location, protein names and feature predictions.

SAAS based evidence for UniProtKB annotation
When an annotation is added to an entry based on an automatic annotation from a SAAS rule, the evidence tag indicates this along with a link to the rule itself.


Browsing SAAS rules
In order to browse the dataset to view rules of interest, click on the dropdown next to the search box in the UniProt website and select ‘SAAS’. Now enter a query and hit the search button.



Exploring SAAS rule pages
Conditions are listed on the left hand side of the rule page and annotations are on the right hand side. If a condition holds true then the corresponding annotation is applied. 


SAAS annotation data is recalculated for every UniProt release to ensure that the annotations are accurate and up-to-date.