Inside UniProt

Monday, May 4, 2015

UniProt Knowledgebase just got smaller!

UniProt release 2015_04 at the beginning of April 2015 saw the number of protein entries in the UniProtKB go from 92,672,207 to 47,262,724. Wondering what happened?

Prior to release 2015_04, UniProtKB had doubled in size in the past year to over 90 million entries with a high level of redundancy. This was especially true for bacterial species where different genomes of the same bacterium have been sequenced and submitted (e.g. 4,080 proteomes for Staphylococcus aureus comprising 10.88 Million entries).

To deal with this redundancy, we developed a procedure to identify highly redundant proteomes within species groups. This procedure was implemented for bacterial species and the sequences corresponding to redundant proteomes (approximately 47 million entries) were deprecated. All of these protein entries belonged to the unreviewed TrEMBL part of the UniProt Knowledgebase. Reviewed Swiss-Prot protein entries will remain unaffected by the procedure.

So how does this procedure actually work? Here we break it down into 4 steps.

Step 1: Group proteomes by taxonomy level

Proteomes can only be redundant to other proteomes of the same taxonomy branch at species level or below (sub-species, strains, etc.).

Step 2: Pairwise comparison of proteomes within each group

We use the CD-Hit 2D program for pairwise comparison of proteomes within each group. Based on the results, we calculate the level of similarity between pairs of proteomes within the groups.

Step 3: Graph analysis

We now select just the proteome pairs with similarity higher than 90%. With these proteomes as nodes, we create a directed weighted graph where edges are the level of similarity. To identify the most redundant proteomes, we rank all proteomes in this graph.

Ranking is by Proteome(Indegree, Outdegree) where for a proteome A:

• Indegree (the higher the better): Number of proteomes that are redundant to proteome A.

• Outdegree (the lower the better): Number of proteomes to which proteome A is redundant.

So, for example:

A(5,1) is better than B(1,1)

Step 4: Elimination of redundant proteomes

Proteomes that rank lowest are the most redundant. These are marked as ‘redundant’ on the UniProt proteomes web portal and protein entries belonging to these redundant proteomes are removed from UniProtKB TrEMBL.

This process is run iteratively to identify all redundant proteomes. All proteomes remain searchable through UniProt’s Proteomes interface (http://www.uniprot.org/proteomes/) and redundant proteome sets are now available for download from the UniProt Archive UniParc.

Tuesday, March 31, 2015

Customise and share your search results

Have you ever wanted to add more information to a UniProt results table to see, for example, which entries have a 3D structure available for them or a known disease correlation? Not only can you customise your table to add or remove columns, you can also now share a custom URL including the columns in your view! This URL can also be used in your programmatic access scripts. Just click on the 'Share' button above the results table to get your custom URL.

To customise your results table, you can reorder, add or remove columns easily using the

button or by simply clicking on the right-most column in the table. Clicking on either of these will take you to the 'Customise results table' page.

The 'Customise results table' page allows you re-order your columns, add columns and remove columns. You can add or remove columns by selecting or unselecting the checkbox next to the column names listed under the 'Add more columns' heading. Then simply drag and drop them in the 'Columns to be displayed'! You can also remove columns by clicking on the 'x' on the labels under 'Columns to be displayed'.

There are various categories of columns available to be selected, each category heading containing several columns under it.

The quickest way to find out whether there a column for the data type you're interested in, just start typing the name into the 'Search' field. It auto-completes your query and suggests all possible matches.

Once you're happy with your selections, click 'Save' to go to your customised results view. Then use the new

button to get the specific URL for the results table with your columns. You can then share this with colleagues to show them your exact view of data or use it in your scripts to download results with all the data types you are interested in.

We hope you'll find this improvement useful! If you have any feedback or suggestions for the future, feel free to comment below or write to us at help@uniprot.org.

Tuesday, January 27, 2015

Online training for UniProt

Would you like to learn more about using UniProt but aren't able to access any training courses locally? Perhaps you would like a course that you can read and watch in your own time? Or even something to share with your students to introduce them to UniProt? We provide several online learning resources that can help! Here's a summary of everything you need to know to get started with using the UniProt website.

For an introduction to using pages on the UniProt website, there is a series of short videos that demonstrate how to use UniProt pages in one-two minutes on the UniProt YouTube Channel. Videos available so far:

If you are looking for a basic overview with further links to detailed help pages, the EBI Train Online portal provides a quick tour of UniProt.

We also provide a full length course through the EBI training portal titled 'UniProt: Exploring protein sequence and functional information'. This interactive and in-depth course that covers many areas including:

An introduction to what UniProt is and when to use it
Where the data and annotation comes from
How to track provenance of information
How to use UniProt datasets and tools
How to download data and submit your own data
Guided examples
Exercises
A quiz to check your learning

We are planning to launch a series of webinars in 2015 to provide more interactive online training. Are there any topics you would like us to cover in particular? Write in and let us know!

Thursday, December 18, 2014

An insight into expert annotation with RS3_HUMAN

UniProt's expert curation consists of manual annotation based on literature and curation tools. You may know that previously unreviewed and automatically generated entries (from TrEMBL) go through expert curation to become reviewed entries in Swiss-Prot. However the curator's role doesn't end at annotating an unreviewed protein entry and making it part of Swiss-Prot. Did you know that reviewed protein entries also undergo revisions by curators? Even well characterised proteins with reviewed Swiss-Prot entries are considered for revision to include information from the latest publications. These revisions can be very valuable as a lot can be learnt from well characterised proteins through important updates such as newly identified enzymatic activities. One such example is the ribosomal S3 protein found in the UniProt entry RS3_HUMAN.

RS3_HUMAN was picked up for revision as it was originally missing a Function annotation. This was the beginning of a comprehensive review during which a UniProt curator read through a number of new publications about this protein. To provide high-quality in-depth experimental annotation, the choice of publications to use is critical. We prioritise publications with (i) a high impact in the scientific community that contain functional data for previously uncharacterized proteins, (ii) new 3D-structural information, (iii) enzymatic reactions that may complete the annotation of known metabolic pathways or networks, (iv) PTMs and their consequences, (v) novel splice variants and (vi) disease-causing variants as well as polymorphisms.

This review resulted in the addition of 26 new publications as sources for the RS3_HUMAN entry. Updates included annotations about function, enzyme activity, interactions, subcellular locations, post translational modifications, etc. You can see the latest function annotation in the entry, now tagged with 16 publications:

To see the full list of updates on any entry, you can use the 'History' button displayed towards the top of the page. In the case of RS3_HUMAN, the History button shows that it was last updated on the 26th of November 2014. To view the exact changes, simply click on the 'Previous versions' link in the dropdown as shown below:

This will bring you the full history page where you can see the version numbers and dates of all changes. To view the exact updates, select the versions that you want to compare. For example, select version 170 with one radio button and version 168 with another and click 'Compare'.

You will then see all the changes, with removed information coloured in red and added information coloured in green. RS3_HUMAN here has a long scrolling page full of green additions, including the Function annotation where it all began!

Wednesday, November 12, 2014

Saving proteins with the UniProt basket

Have you ever browsed through different UniProt proteins, wishing you could save them somewhere for later? That's exactly what the UniProt basket allows you to do. It remembers your saved proteins so you can build your selection over time or simply come back to a saved protein later on. Here's a quick guide to your UniProt basket.

UniProt provides several tools and action buttons you can use directly on the search results page (i.e. Align, Blast, Download and Add to Basket). The basket provides you with the same tools for your saved proteins. You can select proteins within the basket to align them, run a Blast search or download them in various formats. You can delete the entries one by one from the 'Remove' column or use the 'Clear' button to delete all saved entries from the tab you're in (UniProtKB, UniRef or UniParc). You can also use the 'Full View' button to transfer the entries to a full results page for additional functionality such as filters and customisable columns.

You can add proteins to your basket from search results pages for the UniProtKB, UniRef or UniParc datasets. Simply select them in the results table and click on the 'Add to basket' button.

Once you add entries to your basket, you will see a small basket icon appear under the entry ID. You will also see the number on the basket changing to show that new proteins have been added. Clicking on the basket button will show you your saved proteins.

You can also add proteins to your basket from the protein entry pages, using the 'Add to basket' button towards the top of the page.

There is a limit of 400 entries in the basket, so be careful when trying to add large datasets. The contents of the basket will remain there until you delete your browser's cookies or clear the basket yourself.

Try the UniProt basket next time you're looking to save your proteins somewhere. Is there more functionality you would like to see in the basket? Write in and let us know!

Wednesday, October 15, 2014

Introducing Annotation Scores in UniProt

We are pleased to introduce you to annotation scores on the UniProt website! We have recently started providing annotation scores for all UniProtKB entries. Annotation scores are a five point heuristic score. An annotation score of 5 points is associated with the best-annotated entries, and a 1-point-score denotes an entry with rather basic annotation. A 5-point annotation score would look like:

Annotation scores can help you quickly gauge the annotation content in a protein entry. For example, you could see which is the best-annotated protein in a family. We hope the scores will be useful in helping you narrow down to your entries of interest.

You can view annotation scores in the ‘Status’ line on all UniProtKB protein entry pages, as shown below.

You can also add annotation scores to your search results table through the ‘Columns’ button.

How are they used?

There are several contexts in which annotation scores can be used:

UniProtKB
The annotation scores can help you to get a quick idea of the relative level of annotation of the entries in your search results. Please note that search results are not ranked by the annotation score, but by a query score that considers not only the annotation scores of the entries that match your query, but also how often (and where) your query term(s) appear in a matching entry and across the whole database, and the importance of a term according to the total number of terms. For this reason, the best-ranked entries are not necessarily those with the highest annotation scores.

UniRef
We will be using annotation scores to select the representative member of a UniRef cluster.

Reference proteomes
We are using annotation scores to assist the selection of reference proteomes.

How are they computed?

Different UniProtKB annotation types (e.g. protein names, gene names, functional annotations (comments) and sequence annotations (features), GO annotations, cross-references) are scored either by presence or by number of occurrences. Annotations with experimental evidence score higher than equivalent predicted/inferred annotations, thereby favoring expert literature-based curation over automatic annotation.

The score of an individual entry is the sum of the scores of its annotations.

The score of a proteome is the sum of the scores of the entries that are part of the proteome.

Next time you’re looking at a UniProt protein, look out for annotation scores. We welcome your feedback. Would you apply these scores in your work? Would you like to see them in your UniProtKB search results by default? Write in and let us know!

Monday, August 18, 2014

Have you tried UniProt RDF?

RDF is a core technology for the World Wide Web Consortium’s Semantic Web activities (http://www.w3.org/2001/sw/) and is well suited to work in a distributed and decentralized environment. The RDF data model represents arbitrary information as a set of simple statements of the form subject-predicate-object.

Why RDF?

UniProt collects information from the scientific literature and other databases and provides links to over one hundred and fifty biological resources. Such links between different databases are an important basis for data integration, but the lack of a common standard to represent and link information makes data integration an expensive business. One way to tackle this problem at UniProt is by using the Resource Description Framework (http://www.w3.org/RDF/) to represent our data.

Using the UniProt RDF

The UniProt SPARQL endpoint is available in its beta form at http://beta.sparql.uniprot.org/. This SPARQL endpoint contains all UniProt data and is freely accessible. RDF provides the foundation for publishing Linked Data and the UniProt Consortium has been publishing its data in RDF since 2008, both on its web and FTP sites. Since 2013, the EMBL - European Bioinformatics Institute RDF platform also links to the UniProt RDF (http://www.ebi.ac.uk/rdf/).

Information about the UniProt data concepts and relationships in our RDF are available on http://beta.uniprot.org/core/. Additionally, we use some general purpose relationships such as those provided by SKOS (http://www.w3.org/2004/02/skos/), OWL (http://www.w3.org/TR/owl-ref/) and RDFS (http://www.w3.org/TR/rdf-schema/). As an example of data concepts and relationships, the following figure shows the UniProt taxonomy data as linked in our RDF.