Inside UniProt: The UniProt Metal Binding Site Machine Learning Challenge

We would like to invite the machine learning community to help UniProt by creating

computational methods to predict metal binding sites across the whole of UniProtKB.

At present around 17% of curated proteins have annotated metal binding site residues,

which our curators have carefully identified from the literature or known structures

from PDB. UniProt identifies the specific amino acid residues that participate in metal

binding sites and also which metal is bound. For example, for the Neurospora crassa

metallothionein protein (shown below) contains 7 cysteine residues involved in binding

6 copper ions.

When we look at the uncurated TrEMBL section which contains the large majority of known

protein sequences we see that just 3% of proteins have an annotated metal binding site.

These annotations are created by a variety of automated annotation methods currently

used. The difference in coverage between the reviewed (Swiss-Prot) and unreviewed

(TrEMBL) suggests that there are many millions of missing metal binding site annotations

in the 225 million TrEMBL sequences.

We would like to invite interested researchers to take part in a challenge to create new

methods to rapidly predict metal binding site annotations that can be deployed by UniProt

as part of its automatic annotation pipeline. These methods could be completely based on

sequence data or perhaps incorporate information from known and/or predicted structures.

Although we don’t want to prejudge what methodology may work, we are particularly keen

that methods be both accurate and very fast for scalability. All data and software must be

open and not under restrictive licensing terms.

If you would like to take part in this initiative please register your interest by filling out

this google form by 18th February 2022. We will then hold a planning meeting with

the participants to discuss timelines and evaluation of the methods.

Thursday, February 3, 2022