Inside UniProt

Thursday, November 29, 2018

Using UniProtKB to explore the world of protein structure

Protein structures are used to understand the architecture of a protein, to explain how a protein interacts with its ligands or cofactors and to study the composition of protein complexes. They help us to identify the position and nature of post-translational modifications and, as 3D structure is more evolutionarily conserved than primary sequence, can also be used to predict protein function. Identifying proteins sharing a conserved protein fold may help to also ascertain a molecular function that is common to them all. Understanding how topology affects the active sites of enzymes or identifying sequence-conserved regions, such as binding sites or areas of electrostatic potential, on the surface of a protein can also give valuable clues to the role a protein plays in a cell.

Annotation of proteins based on structure-based analyses is an integral part of the work of the UniProt Knowledgebase (UniProtKB). UniProt works closely with the Protein Databank in Europe (PDBe) to map 3D structural entries (~100,000) to the appropriate UniProtKB entries at the individual residue level [1]. It then becomes possible to use the UniProtKB advanced search functionality to ask questions such as ‘How many proteins in the human proteome have at least a partial 3D structure?’

Searching for structural data in UniProtKB

Once you have found the protein you are interested in, use our navigation tool in the entry to move to the Structure section where you may either find more information in the table view or visualise a 3D image. The table view lists all the structures available for that molecule, give details of the method by which the structure has been determined (e.g. X-ray, NMR, Electron Microscopy) and an accurate residue-level mapping to the region of amino acid sequence covered by each structure. Links to a number of external data repositories and resources enable you to access more detailed information. To help our users visualize the structure, we have recently incorporated the LiteMol Viewer, an HTML5 web application that not only provides cartoons, surface and balls and stick visualizations but also links you to the PDBe database, allowing you to view and explore validation and annotation data.

Visualising Bloom's syndrome helicase (P54132) in complex with ADP and duplex DNA.

Hovering over the structure will show you the amino-acid residue-level mappings, a single click and you can zoom in to a more detailed view, for example enabling you to visualize the details of cofactor binding.

Zooming in on Bloom's syndrome helicase to show ADP binding

Knowing the shape of a protein can give you valuable clues to the function of that molecule. Use UniProtKB to explore the links between sequence, structure and function and understand how molecule topology can drive cellular phenotype.

Want to learn more:

Go to our pre-recorded webinars to learn more about the annotation of structural data in UniProtKB

https://www.ebi.ac.uk/training/online/course/protein-structures-and-their-features-uniprotkb

https://www.ebi.ac.uk/training/online/course/uniprot-navigating-between-genes-amino-acid-sequences-and-3d-structures

Monday, August 13, 2018

Cogs of data in UniProt

UniProt's mission is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

Have you ever wondered how long it takes for a new protein sequence to reach you? Or how long it would take for any feedback you send about an entry to become incorporated (at the earliest)? Let’s follow the journey of an imaginary Protein X!

1) Protein X begins its UniProt life cycle when the sequence is imported into the database. As shown in the image below, our protein begins its UniProt life in the blue phase of 'Live Data', named thus because the UniProt production team is actively working on this data. Information can only be merged into protein entries in this blue phase. This phase runs for 4 weeks. For a newly imported sequence like Protein X, this phase consists of:

Importing new/ updated proteins from INSDC, Ensembl, RefSeq, PDBe, direct submissions, etc.
Creating a new UniProt entry for Protein X (or merging with an existing entry if identical)
Adding cross-links to taxonomy information and the source of sequence

2) Protein X now enters the yellow phase of 'Frozen data' for 4 weeks. The UniProt production team freezes the new data and makes it available to some internal/collaborating groups to access it and work on it as follows:

InterPro: to assign Protein X into protein families, identify domains and functional sites
Gene Ontology group: to classify functions into the gene ontology
UniProt curators: to potentially review the protein and annotate data
UniProt automatic annotation: to run rule-based and data-mining pipelines to add annotations

While these internal/collaborating groups work on the entry in this phase, this information can't be merged into the entry yet (and the entry remains frozen). The new information received in this phase goes for post-processing in the next blue phase, as shown by the grey parts of the image below.

At the end of this 4-week yellow phase, Protein X is released to the public! Typically, as no curated information has been merged in yet, the entry is released as an unreviewed (Trembl) entry.

This is the first version of the protein entry that you see, 8 weeks after the sequence was imported into the UniProt databases. It only contains the basic information added in the first blue phase, before the freeze.

So you see it takes a new sequence at least 8 weeks to reach you (depending on when it was submitted), even though there is a UniProt release every 4 weeks. Let's call this first version of Protein X released here as Protein X.1.

3) Protein X.1 now sits in the pink 'Public data' phase as part of the release for 4 weeks, which allows everyone to access it, including UniProt users and other external groups. They can send feedback, requests, additional papers, etc. about Protein X.1 to UniProt. This might trigger more expert curation by UniProt or other improvements to the entry for the future. The release is visible for 4 weeks and then archived when a new release is made.

The grey lines in the figure below show what happens in parallel to the first run of yellow and pink phases. While Protein X.1 is in the pink 'Public data' phase for 4 weeks, the next version of Protein X, let's call it Protein X.2, is flowing through its second blue over 4 weeks where information received during or after the previous yellow phase is being merged into it. If information from expert curation is merged into the entry here, the entry becomes a Reviewed (Swiss-Prot) entry.

Following this blue phase, Protein X.2 will be frozen and spend 4 weeks in the yellow phase and then be released to the public as the second version of the entry with more information added to it, i.e. ProteinX.2!

The cogs of the blue, yellow and pink phases are always turning. So, after the pink 'Public data' release phase, any improvements sent by users are fed into a future version of Protein X, say Protein X.Future, which will flow through a future cycle of the blue phase and then the yellow phase before becoming available in a public release again in the pink phase.

To summarise, if a new protein sequence is imported into UniProt in January, it would become available to you via a UniProt release in March. If you then sent us feedback about this entry, e.g. another paper about it, this information would become available in UniProt at the earliest in May (or later depending on the complexity of the suggestions).

There is a UniProt release every 4 weeks but behind the scenes, the blue, yellow and pink phases are always running in parallel. It's not only new protein sequences like Protein X that go through these phases but also existing UniProt entries so they can get new information added too. While you only see a release every 4 weeks, the cogs of data are always turning!

Tuesday, July 3, 2018

New guidelines to help with protein naming

Why is consistent protein naming important?

For many proteins, a variety of different names are used across the scientific literature and public biological databases which makes effective organization and exchange of biological information a difficult task. Consistent protein nomenclature is indispensable for communication, literature searching and retrieval of database records.

New protein nomenclature guidelines

To address this issue and provide some help in protein naming, a set of protein nomenclature guidelines have been produced jointly by the European Bioinformatics Institute (EMBL-EBI), the National Center for Biotechnology Information (NCBI), the Protein Information Resource (PIR) and the Swiss Institute for Bioinformatics (SIB). UniProt has been heavily involved in this work along with other groups from the four institutes. These efforts have built on existing guidelines which were already in use by groups such as UniProt and RefSeq, expanding and consolidating them into a single shared document which provides a comprehensive set of recommendations.

What makes a good protein name?

A good protein name is one which is unique, unambiguous, can be attributed to orthologs from other species and follows official gene nomenclature where applicable. The guidelines help to achieve this goal by covering all aspects of protein naming from advice on expert sources of protein names and how to name novel proteins of unknown function to more detailed advice such as terms to avoid in a protein name and acceptable abbreviations.

Who are the guidelines intended for?

The guidelines are intended for use by anyone who wants to name a protein. Groups who will find these guidelines helpful include:

Biocurators who want to assign a protein name as part of a database record
Bioinformaticians who intend to assign protein names as part of gene annotation pipelines prior to submission to public archives
Researchers who isolate a new protein and want to name it prior to publication

Where are the guidelines?

The guidelines can be found on the UniProt website at http://www.uniprot.org/docs/International_Protein_Nomenclature_Guidelines.pdf and on the NCBI website at https://www.ncbi.nlm.nih.gov/genome/doc/internatprot_nomenguide/.

Can I give feedback?

We welcome feedback on these guidelines and are happy to receive any comments at help@uniprot.org which we will share with the other groups involved in producing the document and use to improve the guidelines in future updates.