Tuesday, November 18, 2025

Germ Warfare: Arsenal of Antimicrobial Resistance Proteins in UniProt

Antimicrobial resistance (AMR) contributes to almost 5 million deaths annually, worldwide (source: WHO). To mark World AMR Awareness Week (18th-24th November), we are highlighting how UniProt supports global AMR research efforts.

The language of germ warfare may be emotive, but fighting AMR is a bit like an arms race - every time we develop new antibiotics, microbes acquire ways to resist them. Only a few years after the discovery of a natural beta-lactam drug, penicillin, bacteria were isolated which were capable of inactivating this antibiotic, because they made penicillin-degrading enzymes. These enzymes are called beta-lactamases. Chemists then developed modified penicillins, including a class known as carbapenems, to deal with the problem of AMR. Carbapenems were effective for some time, as many beta-lactamases, e.g. GES-1, could not degrade this new class of drugs. However beta-lactamases acquired modifications, sometimes just a single amino acid, to confer carbapenem resistance e.g. in GES-5. And so the fight continues.

Another important class of proteins involved in AMR confers resistance not by degrading antibiotics, but instead by blocking antibiotic binding through modification of the bacterial cell wall. Examples include an enzyme called Mcr-2 which catalyzes the addition of a phosphoethanolamine moiety to lipid A in the cell wall, thereby conferring resistance to colistins, an antibiotic class which is often a treatment of last resort.

Alliances supporting the fight against AMR

UniProt has a long history of curating microbial proteins, including AMR targets; recently, we have focused on curation of AMR-associated proteins with high clinical priority. We initially concentrated on beta-lactamases belonging to clinically important subfamilies and then diversified to proteins involved in resistance to antibiotics which are often considered last-resort choices, such as vancomycin and the colistins. More recently, curation targets included proteins indirectly involved in AMR, such as those associated with biofilm formation, e.g. BrlR, because biofilms act as a permeability barrier and the bacteria inside are protected from antibiotics.

A wide range of experimental data is added to each entry including information related to the function of the protein as shown in the section from the BrlR record below.

One of the great strengths of UniProt is its role as a central data hub, linking resources and encouraging alliances. We link to general microbial biological resources such as EnsemblBacteria and EcoCyc, and have recently added links to a key resource directly related to AMR: the Comprehensive Antibiotic Resistance Database (CARD), as exemplified in the section from the MdtF record below. These links connect users to information in specialised resources, allowing easy access to other collections.

Community feedback is enormously helpful, so please help us in the ongoing battle against AMR by suggesting annotation priorities using our contact form - which AMR-related proteins should we be curating next?

Investigate our links to microbiological databases:

See all UniProtKB entries with links to CARD
See all UniProtKB entries with links to EcoCyc
See all UniProtKB entries with links to EnsemblBacteria

Monday, September 15, 2025

Using UniProtKB to navigate large and complex structures

With advances in structural biology, protein structures are becoming larger and more complex than ever. How do we navigate these complex structures?

PDB:8F2U captures the human COMMD–CCDC22–CCDC93 (CCC) complex, which is part of the Commander complex that plays a vital role in sorting transmembrane cargo for endosomal recycling. The CCC complex has 12 different protein components, but how do we identify the functional role of each protein?

A handy tool is UniProt's advanced search function, where you can find all the protein components using the PDB entry ID (xref:pdb-8F2U). This directs you to the individual entry pages, where you can learn about the biological role of each protein.

You can also use the ‘Customize columns’ option to include 3D structures as an additional column. This provides a glimpse of the available structures for each protein.

Which structure is the ‘best’ structure?

There is often more than one experimentally determined structure available for a protein and they might provide different information. In the UniProt entry page, you can go to the Structure section for a list of all the available structures. Let’s take a look at COMMD7 (AC:Q86VX2).

4 structures are available for this protein and they were solved using different methods with varying resolutions. X-ray crystallography often provides high-resolution structures for smaller proteins, while cryo-EM can capture the conformation of larger and more dynamic protein complexes. Understanding how the structures were solved can help us select the most informative structure for our study.

Another useful piece of information is the ‘POSITIONS’ column. The positions here indicate the construct used to determine the structure and they may not cover the full-length protein. In this example, if you are interested in the N-terminal 130 residues of COMMD7, they are found in PDB:8F2R, 8F2U, 8P0W, but not in PDB:8ESD. In some cases, even if a full-length construct is used, some regions may not be observable in the structure. This is usually due to technical limitations in resolving highly dynamic regions in a protein.

If you are interested in the conformation of the full-length protein, you can refer to the AlphaFold prediction. UniProt provides an AlphaFold model for most protein entries, predicted based on the canonical sequence. These full-length models can provide insights for regions that are absent in experimentally determined structures.

Where is my residue of interest?

A high-resolution protein structure can provide residue-level information about a protein. For example, post-translational modification (PTM) analysis by the PTMeXchange project showed that residue Lys90 in human COMMD7 is ubiquitinated. But where is it in the structures?

UniProt and PDB entries provide a consistent residue-level mapping through the SIFTS project. This means for all the PDB structures mapped to the same UniProt entry, you can expect to see the same Lys90 in the structures covering this region of the protein.

Using the UniProt Feature viewer, you can identify where Lys90 is.

Clicking on Lys90 will highlight this residue across all the feature viewer tracks. It will also zoom in and highlight Lys90 in the structure viewer. But don't forget to select a structure that covers residue 90! (Hint: not PDB:8ESD)

How to use this information for further studies?

If you want to study these structures in detail, you can directly download them from the UniProt entry page.

PyMOL and UCSF Chimera are among the most common molecular visualisation tools in bioinformatics and computational chemistry. You can use them for more complex analysis, such as measuring bond lengths, docking small molecules and simulating conformational changes.

We can use the information from UniProt to get us started. For example, knowing that chain G in PDB:8F2U is COMMD7, you can highlight COMMD7 in the structure of the 12-subunit CCC complex. You can also identify Lys90 and check if it is accessible for ubiquitination in the complex.

Try these commands in PyMOL:

fetch 8f2u

select COMMD7, 8f2u and chain G

util.cbay COMMD7

show sticks, (COMMD7 and resi 90 and not name N+C+O)

label n. CA and i. 90 and COMMD7, '%s%s' % (resn, resi)

Have fun navigating the world of protein structures!

Learn more about structural annotations in UniProt: https://www.uniprot.org/help/structure_section

Learn more about multiple structures for the same protein:
https://www.uniprot.org/help/multiple_pdb_xrefs

Learn more about sequence coverage in structures:

https://www.uniprot.org/help/structure_subseq

Wednesday, June 25, 2025

UniProt - the ultimate colleague on your biological research team!

How many members do you have on your team and have you ever considered UniProt as one of them?

UniProt is a suite of open access protein databases, accessed by 9 million unique visitors a year, but how much money does it save you?

Our contribution to the scientific community and wider economy has now been analysed in a case study by CSIL, as part of the EU-funded project PathOS. This study investigated the cost-benefit of open data resources, and an analysis of UniProt’s impact between the years of 2017 and 2023 has been published.

What are the main costs of maintaining UniProt?
Across our three consortia sites, EMBL-EBI, SIB, PIR, we have office space, equipment, consumables and publication costs. But our main outgoings (70%) are salaries for our teams of expert biocurators and software developers. Cost to a user is measured in time spent providing valuable voluntary knowledge contributions. 49% of our users visit weekly and 26% daily.

What are the benefits to using UniProt?
The immediate benefits for users are significant time savings. UniProt integrates data from over 180 cross-reference databases, combines it with expert manually curated sequence and protein data, then puts it all in one place, in a standardized format. This saves users time because they don’t have to navigate between multiple resources or download data in different formats. Additionally users do not have to sift through numerous scientific publications to understand the current cutting edge research available for a given protein. It also helps streamline research and analysis pipelines by providing centralized ready-to-use data.

Long-term benefits are measured in publications and patents that mention UniProt, indicating the wider impact on scientific knowledge advancement and technological development. Over a period of 7 years over 15,200 publications and over 183,000 patents cited or referenced UniProt. Many of these patents go on to be referenced by subsequent patents spanning a number of biotechnology and health innovation fields. Economic benefits have also been seen in the number of start-up companies that rely on UniProt data for their business model.

Demonstrating the value of UniProt

Each user gains a net benefit from UniProt of up to €5,475 and saves 219 hours a year.
Users say that our main strength ‘is the ability to integrate protein sequences identified in the literature with an extensive body of functional information’. This is facilitated by incredibly passionate teams of expert biocurators and software developers that ensure data is curated into the database reliably, and presented in a consistent and easily accessible manner.
Users agree ‘that there is no alternative offering the same breadth of knowledge, quality and level of integration as UniProt’.
Overall users say: ‘UniProt plays a crucial role in accelerating scientific research and innovation in various forms, thereby facilitating the creation of new knowledge.’
In total UniProt provides a benefit of between €373-565 million per year to its community of scientific users.

Learn more about the details of how UniProt supports research and innovation by reading the full report: Measuring the value and impact of open science

Links to associated articles

ELIXIR - https://elixir-europe.org/news/UniProt-CBA

SIB - https://www.sib.swiss/news/uniprot-user-benefits-up-to-39-times-higher-than-operational-costs

PathOS - https://pathos-project.eu/open-science-value-costs-and-benefits-for-whom-how-to-support-informed-investment-decisions

The UniProt team would like to take the opportunity to thank our funders, collaborators and users for their time, support and contributions to the database. UniProt is a key part of the EMBL-EBI, SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR) and ELIXIR

infrastructure. Our main funders are EMBL, The State Secretariat for Education, Research and Innovation SERI (Switzerland), and NIH (USA). We are one of ELIXIR’s Core Data Resources, and two of our three partners, EMBL-EBI and SIB, are ELIXIR Nodes. The findings of this study will support efforts to advocate for long-term funding for critical biodata resources.

Wednesday, June 18, 2025

Capturing the Diversity of Life - Reorganizing the Protein Space in UniProtKB

Advances in genome sequencing technology means that large-scale efforts such as the Earth Biogenome project and the Darwin Tree of Life are aiming to produce high-quality reference genomes for individual species, to capture the biodiversity of our planet. Each of these genomes will translate to complete set of proteins for that species. To enable UniProt to present this wealth of data to our users, we are making significant improvements to our data content.

Upcoming changes in our selection of Reference Proteomes

In release 2025_04 (currently scheduled October 2025), we will deploy a new Reference Proteome selection pipeline to improve the representation of species biodiversity in the UniProt Knowledgebase (UniProtKB). From release 2026_02 onwards, we will restrict the protein space in UniProtKB to those sequences which are part of a Reference Proteome in addition to the expert reviewed UniProtKB/Swiss-Prot section, and also entries containing additional biologically important data such as a 3D structure.

What is a Reference Proteome?

A proteome is the set of all translated proteins from a genome assembly. For each species, we generally use an automatic pipeline to select one proteome as the Reference Proteome - the proteome that we believe is the best representative of the proteins encoded by that species (for example, the Reference Proteome for Drosophila simulans is UP000000304 as it provides the best coverage of the protein space for the species). In addition, proteomes of well-studied model organisms and other proteomes of interest for biomedical and biotechnological research may be selected as Reference Proteomes by UniProt curators.

Is every sequenced proteome currently in UniProtKB?

No. While we currently include many proteomes that are not Reference Proteomes in UniProtKB, we already exclude many others, for example those which are of poor quality or where the species is already over-represented in UniProtKB.

Why are we reorganizing the protein space in UniProtKB?

Submissions of genomes to hubs such as the International Nucleotide Sequence Database Collaboration (INSDC) have grown due to the rapid increase in sequencing capabilities, resulting in a large influx of proteomes into UniProtKB. Our new pipeline will provide a much better representation of the biodiversity of life and coverage of the sequence space and improve user experience when searching and selecting the best proteome for their research work. It will also allow us to improve our functional annotation of the proteins in the novel reference proteomes newly imported into UniProtKB in order to provide an enhanced understanding of the biology of these species.

Figure 1 - Growth of UniProtKB entries throughout the years

What is going to happen over the next few months?

In release 2025_04, we will deploy the new pipeline that will improve our selection of Reference Proteomes for cellular organisms (more details below), and we will align our viral Reference Proteomes to the set of exemplar genomes from the International Committee on Taxonomy of Viruses (ICTV). Additionally, we will start the process of removing proteins from non-Reference Proteomes from UniProtKB. In the first stage (2025_04), this includes proteins from taxonomically unclassified proteomes. In release 2026_02 we will remove the remainder of proteins from non-Reference Proteomes from UniProtKB.

How does the new pipeline work?

The new pipeline to select Reference Proteomes has been designed to select the proteome that best represents the protein space of a species using a clustering system based on MMseqs2 [1] to select one or a few, proteomes for each species. It only analyzes high quality proteomes from species with a recognized taxonomy and a formal scientific name. Problematic proteomes, such as proteomes from contaminated genomes, for example, are excluded from the analysis.

MMseqs2 is used to cluster first proteins, then proteomes:

1. Protein clusters: firstly, it clusters similar proteins from different proteomes of the same species. The more proteins a proteome has in different clusters, the more likely it is to be selected as a reference.

2. Proteome clusters: secondly, it calculates how similar the proteomes of the same species are, based on the number of protein clusters shared between any two proteomes. Proteomes are clustered together if they share 50% or more protein clusters.

Figure 2 - Diagram of the Reference Proteome selection pipeline

This new pipeline will select at least one Reference Proteome for every sequenced species for inclusion in UniProtKB, ensuring a broad representation of the Tree of Life. It may select more than one Reference Proteome for species with a high genomic diversity. To keep the set of Reference Proteomes stable, but not static, between UniProt releases the pipeline preferably selects the most complete proteome (as determined by BUSCO [2]) and will only replace an existing Reference Proteome when a new one is published which is of significantly higher quality.

How big a change will this make to UniProtKB?

The changes in UniProt, between release 2025_04 and 2026_02, will result in a change in the content and size of the database. The number of Reference Proteomes will increase by 36% (reflecting a 34% increase on species covered), while the number of proteins in UniProtKB can be decreased by 43%

What do I do if the proteome I am working on is no longer in UniProtKB?

All proteomes that are not selected as Reference Proteomes by the new pipeline will be available through UniParc. When you search for such proteomes in the Proteomes portal, you will be directed to UniParc to access your protein set. The UniParc FASTA header format for proteomes will be improved to show the protein and gene names, and database identifiers, of the underlying genome records from EMBL, Ensembl or RefSeq.

If the annotation provided by UniProtKB is particularly important to your work, or your organism is actively worked on by a research community, but has not been selected as a Reference Proteome, please Contact us and we will consider promoting it to Reference Proteome status. The list of proteomes that will be deprecated by release 2026-02, and also the proteins associated with them, is available on our FTPsite. The same link will allow you to access an additional document which lists all the proteins we will be retaining in UniProtKB.

Please note, that in response to requests from our user community, we have delayed removal of the majority of non-reference proteomes from release 2026_01 to 2026_02, to give more time for the update of their downstream pipelines

[1] MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets | Nature Biotechnology

[2] BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes | Molecular Biology and Evolution | Oxford Academic

Proteome help page: https://www.uniprot.org/help/proteome

MMseqs2 GitHub: https://github.com/soedinglab/MMseqs2

Tuesday, May 6, 2025

Rich Epitope Information Comes to UniProt

Mammalian immune responses are mediated by interactions between antigens and

immune system components such as antibodies, B cells, and T cells. However,

antibodies and immune cells do not bind to entire antigens, which are usually

proteins or large polysaccharides; instead, they recognize one or more small

regions within the antigen called epitopes. Characterizing epitopes gives us

insight into infectious diseases, autoimmune diseases, and cancer,

and leads to therapeutic innovations such as the development of more effective

vaccines.

UniProt curators have traditionally included information about protein epitopes

from the literature as part of the process of manually annotating protein entries.

However, epitope information in UniProt recently got a big boost from a

collaboration with the Immune Epitope Database (IEDB). The IEDB is a

freely available, manually curated resource that catalogs experimental data

on antibody and T cell epitopes in humans and other animal species in

the context of a variety of diseases and conditions.

Thanks to the UniProt-IEDB collaboration, epitopes curated by the IEDB can be viewed

in a track in the UniProt Feature Viewer with links back to the IEDB. In addition,

publications with epitope information identified by the IEDB are now accessible on UniProt

Publications pages, and IEDB epitopes are searchable using the Proteins API.

The collaboration has enhanced UniProt with information about more than 700,000

naturally occurring, linear peptide epitopes in over 57,000 proteins, citing over 7,000 papers that describe their experimental characterization.

For example, consider the protein O-phosphoseryl-tRNA(Sec) selenium transferase

(SEPSECS; UniProt ID: Q9HD40). SEPSECS (aka SLA/LP autoantigen), which

normally functions as an enzyme in selenoprotein biosynthesis, is an autoantigen in

autoimmune hepatitis (AIH), a chronic inflammatory disease of the liver. Patients

with AIH have circulating antibodies against SEPSECS as well as lymphocyte

infiltrations in the liver. The UniProt entry page identifies the region from amino acids

474-493 as an SLA/LP epitope based on UniProt curation of a publication

characterizing autoantibodies in AIH (PMID:11826415; top panel of figure). The

Epitopes track of the Feature Viewer (middle panel of figure) displays the epitopes of

SEPSECS that have been curated by the IEDB, aligned with their positions in the

protein sequence. Clicking on an epitope brings up a box with the epitope sequence,

the experiments in which the epitope was studied, and a link to the epitope’s page in IEDB.

Finally, the Publications page (bottom panel of figure) lists a paper with epitope

information that was cited by IEDB (PMID:18773898). According to the accompanying

annotation, SEPSECS is a target of T cells in patients with AIH. A review of the abstract

reveals that the paper describes the identification of multiple epitopes in SEPSECS that

are recognized by autoreactive CD4+ T cells in AIH.

The inclusion of epitope and related immune information from IEDB in UniProt is an

exciting development that will hopefully prove valuable to immunologists and others

interested in the role of the immune system in health and disease.