Capturing the Diversity
of Life - Reorganizing the Protein Space in UniProtKB
Advances in genome sequencing technology means that
large-scale efforts such as the Earth
Biogenome project and the Darwin
Tree of Life are aiming to produce high-quality reference genomes for
individual species, to capture the biodiversity of our planet. Each of these
genomes will translate to complete set of proteins for that species. To enable
UniProt to present this wealth of data to our users, we are making significant
improvements to our data content.
Upcoming changes in
our selection of Reference Proteomes
In release 2025_04 (currently scheduled 27th August 2025),
we will deploy a new Reference Proteome selection pipeline to improve the
representation of species biodiversity in the UniProt Knowledgebase
(UniProtKB). From release 2026_01 onwards (currently scheduled 25th February
2026), we will restrict the protein space in UniProtKB to those sequences which
are part of a Reference Proteome in addition to the expert reviewed
UniProtKB/Swiss-Prot section, and also unreviewed entries associated with
experimental Gene Ontology annotations or additional biologically important
data such as a 3D structure.
What is a Reference
Proteome?
A proteome is the set of all translated proteins from a
genome assembly. For each species, we generally use an automatic pipeline to select
one proteome as the Reference Proteome - the proteome that we believe is the
best representative of the proteins encoded by that species (for example, the
Reference Proteome for Drosophila simulans is UP000000304 as it provides the
best coverage of the protein space for the species). In addition, proteomes of
well-studied model organisms and other proteomes of interest for biomedical and
biotechnological research may be selected as Reference Proteomes by UniProt
curators.
Is every sequenced
proteome currently in UniProtKB?
No. While we currently include many proteomes that are not
Reference Proteomes in UniProtKB, we already exclude many others, for example
those which are of poor quality or where the species is already
over-represented in UniProtKB.
Why are we
reorganizing the protein space in UniProtKB?
Submissions of genomes to hubs such as the International
Nucleotide Sequence Database Collaboration (INSDC) have grown due to the rapid
increase in sequencing capabilities, resulting in a large influx of proteomes
into UniProtKB. Our new pipeline will provide a much better representation of
the biodiversity of life and coverage of the sequence space and improve user
experience when searching and selecting the best proteome for their research
work. It will also allow us to improve our functional annotation of the
proteins in these proteomes in order to provide an enhanced understanding of
these species.
Figure 1 - Growth of UniProtKB entries
throughout the years
What is going to
happen over the next few months?
In release 2025_04, we will deploy the new pipeline that
will improve our selection of Reference Proteomes for cellular organisms (more
details below), and we will align our viral Reference Proteomes to the set of
exemplar genomes from the International Committee on Taxonomy of Viruses (ICTV).
Additionally, we will start the process of removing proteins from non-Reference
Proteomes from UniProtKB. In the first stage (2025_04), this includes proteins
from taxonomically unclassified organisms. In release 2026_01 we will remove
the remainder of proteins from non-Reference Proteomes from UniProtKB.
How does the new
pipeline work?
The new pipeline to select Reference Proteomes has been
designed to select the proteome that best represents the protein space of a
species using a clustering system based on MMseqs2 [1] to select one or a few,
proteomes for each species. It only analyzes high quality proteomes from
species with a recognized taxonomy and a formal scientific name. Problematic
proteomes, such as proteomes from contaminated genomes, for example, are
excluded from the analysis.
MMseqs2 is used to cluster first proteins, then proteomes:
1. Protein clusters: firstly, it
clusters similar proteins from different proteomes of the same species. The
more proteins a proteome has in different clusters, the more likely it is to be
selected as a reference.
2. Proteome clusters: secondly,
it calculates how similar the proteomes of the same species are, based on the
number of protein clusters shared between any two proteomes. Proteomes are
clustered together if they share 50% or more protein clusters.
Figure 2 - Diagram of the Reference Proteome selection
pipeline
This new pipeline will select at least one Reference
Proteome for every sequenced species for inclusion in UniProtKB, ensuring a
broad representation of the Tree of Life. It may select more than one f Reference
Proteome for species with a high genomic diversity. To keep the set of
Reference Proteomes stable, but not static, between UniProt releases the
pipeline preferably selects the most complete proteome (as determined by BUSCO
[2]) and will only replace an existing Reference Proteome when a new one is published
which is of significantly higher quality.
How big a change will
this make to UniProtKB?
The changes in UniProt, between release 2025_04 and 2026_01,
will result in a change in the content and size of the database. The number of
Reference Proteomes will increase by 36% (reflecting a 34% increase on species
covered), while the number of proteins in UniProtKB can be decreased by 43%
What do I do if the
proteome I am working on is no longer in UniProtKB?
All proteomes
that are not selected as Reference Proteomes by the new pipeline will be
available through UniParc. When you search for such proteomes in the Proteomes portal, you will be directed to UniParc to access your protein set. The
UniParc FASTA header format for proteomes will be improved to show the protein
and gene names, and database identifiers, of the underlying genome records from
EMBL, Ensembl or RefSeq.
If the annotation provided by UniProtKB is particularly
important to your work, or your organism is actively worked on by a research
community, but has not been selected as a Reference Proteome, please Contact
us and we will consider promoting it to Reference Proteome status.
The list of proteomes that will be deprecated by release 2026-01 is available
on our FTP
site.
Proteome help page: https://www.uniprot.org/help/proteome
MMseqs2 GitHub: https://github.com/soedinglab/MMseqs2
No comments:
Post a Comment