Monday, September 7, 2020

Association-Rule-Based Annotator (ARBA) in UniProt

UniProt has developed an automatic annotation system to enhance unreviewed TrEMBL entries in the UniProt Knowledgebase (UniProtKB) by enriching them with automatically predicted annotations. In release 2020_04 of August 2020, a new powerful automated system called ARBA replaced the previous SAAS (Statistical Automatic Annotation System) system.  ARBA is a multiclass learning system trained on expertly annotated entries in UniProtKB/Swiss-Prot. ARBA uses rule mining techniques to generate concise annotation models with the highest representativeness and coverage for annotation, based on the properties of InterPro group membership and taxonomy.

ARBA currently generates around 23 thousand models, resulting in annotations for more than 85 million proteins including 35 million that lacked any previous annotation. Consequently, UniProtKB witnessed an increase in automatic annotation coverage from 35% to 50%. All ARBA rules can be accessed here and relevant rules are also tagged as evidence for annotations from UniProtKB entries.

ARBA-based evidence for UniProtKB annotation

When an annotation is added to an entry based on an automatic annotation from an ARBA rule, the evidence tag indicates this along with a link to the rule itself, for example, the protein entry Q4SML2  derives annotation from ARBA rule ARBA00000621.

Browsing ARBA rules

In order to browse the dataset to view rules of interest, click on the dropdown next to the search box in the UniProt website and select ‘ARBA’. Now enter a query and hit the search button.

Exploring ARBA rule pages

Conditions are listed on the left-hand side of the rule page and annotations are on the right-hand side. If a condition holds true, then the corresponding annotation is applied.

ARBA annotation data is recalculated for every UniProt release to ensure that the annotations are accurate and up-to-date.