Inside UniProt

Friday, December 11, 2015

View protein sequence annotations as genome browser tracks

With the latest UniProt release 2015_12, we are introducing new genome annotation track files in both BED and bigBed formats that will allow you to view human UniProtKB sequence feature annotations such as domains, sites and post-translational modifications as genome browser tracks! This initial beta release of the UniProt genome annotation tracks resource contains sequence annotations for human only but it will be followed by additional species in the future.

As well as the standard tracks provided by the UCSC and Ensembl genome browsers, both browsers allow users to upload additional tracks that annotate the genome further to help understand its architecture . Genome browser tracks also allow users to analyze their own sequencing data against the reference genome data and genome annotations. You will now be able to upload files from UniProt to genome browsers to be able to easily compare UniProtKB protein features with other genomic information and also with your own sequencing data if available, bridging the protein and gene visually.

Each species represented (currently only human) within the genome annotation tracks resource will have its sequence annotations defined with the BED and bigBed formats.

For example the human active site BED file is called: UP000005640_9606_act_site.bed. BigBed formatted files have a .bb extension. You will see two directories on the FTP site for each species (currently only human), one directory for the BED files and a track hub directory that can be used to add all UniProtKB sequence annotations for a species to a genome browser.

All UniProt annotation tracks can be added in one single step by adding the UniProt species track hub. Simply copy the URL for the species hub.txt file and follow the genome browser instructions on how to add a track hub.

Adding a UniProt species track hub to the Ensembl genome browser.

UniProt FGFR2 features uploaded using a track hub visualized in the Ensembl genome browser.

Adding a track hub in the UCSC genome browser.

UniProt FGFR2 features uploaded using a track hub visualized in the UCSC genome browser.

In order to add specific feature annotation tracks on a genome browser like Ensembl, simply copy the URL to the file and follow the instructions on how to add custom tracks in the Ensembl genome browser or UCSC genome browser. Individual bigBed files can be added as tracks to a genome browser by utilizing the track definitions provided in the species tracks.txt (UP000005640_9606_tracks.txt) file.

Adding custom tracks to the Ensembl genome browser

Adding custom tracks or track definitions to UCSC genome browser.

UniProt active site annotation track in the Ensembl genome browser.

We welcome your feedback on this new resource!

Tuesday, November 24, 2015

UniRule automatic annotation system in UniProt

UniProt has developed two prediction systems, UniRule (Unified Rule system) and the Statistical Automatic Annotation System (SAAS) to automatically annotate unreviewed UniProtKB/TrEMBL entries in an efficient and scalable manner.

UniRule is a rule-based automatic annotation system that consists of rules devised and tested by experienced curators using experimental data from expertly annotated entries. It automatically annotates entries with a high degree of accuracy. This helps leverage curators' knowledge and expertise to add annotation to a much larger set of protein entries than are possible to annotate solely through expert curation.

UniRule has been developed by merging existing curated rule-based systems (HAMAP, PIR name and site rules, and RuleBase rules) into one system which stores, applies, and evaluates all rules.

What is a rule and how does it work to annotate proteins?

Let us look at a fictitious rule to see how this concept works for a basic rule.

Could you make this rule even more granular and specific by adding more conditions?

In this example, the main conditions delineate the space that can be annotated as a 'purple quadrilateral' and the further conditions help add more specific annotation of being a 'square' to a subset.

This is essentially how rules are created with main conditions and additional conditions to identify sequence matches for which certain annotation can be applied with confidence. The quality of the rules is maintained thanks to the expert curators creating and checking rules before application.

UniRule annotation in protein entries

If a protein entry contains annotations from the UniRule system, this is indicated in the entry, as seen below.

Clicking on the evidence will take you to the rule that is the source of that annotation. Here you can click on the annotations you're interested in and see how they are applied through the rule or click on the conditions you're interested in and see which annotations they would apply.

If you are interested in exploring rules for proteins, taxonomic groups etc. of your interest, you can also search the UniRule set directly. Just click on the dropdown to the left of the search box to change the focus from 'UniProtKB' to 'UniRule' and search for your query of interest.

So now you can explore rules that UniProt has built to annotate the sequence space of your interest! We always love to hear feedback so please let us know how you would plan to use this functionality and if there is any additional functionality you would find useful. You can always also email us as help@uniprot.org with queries and feedback.

Thursday, September 10, 2015

Linking proteins via pathways

Proteins in UniProt are now linked and connected by pathways! When looking at your protein of interest, you will now be able to see if it is involved in any known pathway and then be able to follow links to other proteins involved at different stages of the pathway hierarchy. This allows you to traverse the world of proteins through the pathways that connect them!

Let's follow the example of protein 3-hydroxyanthranilate 3,4-dioxygenase in Baker's yeast. This protein catalyses the oxidative ring opening of 3-hydroxyanthranilate. When looking at the protein in UniProt http://www.uniprot.org/uniprot/P47096, I see the 'Pathway' comment in the 'Function' section. Let's look at this comment more closely.

I can see the main pathway title that my protein is involved in, in this case NAD(+) biosynthesis. I see exactly which step of which subpathway my protein is involved in. I then see all steps of the subpathway listed out. My protein is involved in Step 3 but I can also see links to the proteins that are involved in the first two steps.

The subpathway, its parent pathway and superpathway are all linked to UniPathway for more information. The final line in the 'Pathway' comment provides links to all proteins involved in the same subpathway (from the same organism) as my protein, its parent pathway and even the superpathway another level up from the parent pathway. In this example, if I follow the link to Cofactor biosynthesis, I see all 63 proteins involved in this pathway listed out.

Try this out and let us know what you think! Your feedback and suggestions are always welcome.

Wednesday, September 2, 2015

Have you tried our new Beta UniProtJAPI?

A new Beta version of the UniProtJAPI is now available! It aims to improve several issues encountered by the current version such as frequent library updates, retrieval speeds and server availability. We invite you to try the Beta JAPI and share your feedback via a short survey so we can provide the best service for your needs.

Please try the Beta UniProtJAPI for your tasks at http://wwwdev.ebi.ac.uk/uniprot/remotingAPI/index.html and then fill out this short 5 minute survey: https://www.surveymonkey.com/r/2DYCMQM.

Here are some examples of tasks you could try on the JAPI (or simply use your own tasks):

Create a UniProt, UniRef and UniParc service, and use each of these to find out the number of entries in the release.
Retrieve entries containing keywords “Kinase” or “Amyloidosis”.
Retrieve entries that have been updated in the last six months, and then filter only those that are reviewed (Swiss-Prot).
Given the PFAM signature “PM00228”, find the associated reviewed (Swiss-Prot) and non-reviewed (TrEMBL) entries.

We hope you will find our new JAPI useful!

Thursday, July 9, 2015

See you at ISMB-ECCB 2015!

The 23rd Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2015) hosted by the International Society for Computational Biology (ISCB) will be held together with the 14th Annual European Conference on Computational Biology (ECCB 2015), jointly as ISMB/ECCB 2015. This year the conference is being held in lovely Dublin and a bunch of us from UniProt will be in attendance!

We will be presenting talks and posters at the Special Interest Groups being held before the conference and also at the main conference itself. UniProt's PI, Alex Bateman, will also be co-chairing the conference this year! Would you like to hear about the latest developments in UniProt and interesting new projects that we've been working on? Do you have some questions about UniProt that you'd like to discuss? Or maybe you'd just like to see some friendly faces in Dublin while you're there? Come find one of us! Below is a list of who are, where we'll be and what we'll be presenting. Tweet us or message us on facebook. We're looking forward to seeing you there!

Saturday, 11 July 2015
09.40 to 10.00	Claire O’Donovan - Building a community for genome and proteome annotation	Talk in the AFP (Automated Function Prediction) SIG
11.10 to 11.30	Tunca Dogan - UniGOPred and ECPred : Automated Function Prediction Tools Based on A Combination of Different Classifiers	Talk in the AFP (Automated Function Prediction) SIG
15.05 to 16.10	Tunca Dogan - Protein Function Prediction in UniProt with the Comparison of Structural Domain Arrangements	Poster in the AFP (Automated Function Prediction) SIG
Sunday, 12 July 2015
17.45 to 19.00	Andrew Nightingale - Bringing protein functional positional annotation knowledge into reference genomes	Poster (odd) A37
17.45 to 19.00	Tunca Dogan - ECPred: Enzyme Prediction Using a Combination of Classifiers	Poster (odd) L63
Monday, 13 July 2014
14.00 to 14.20	Benoit Bely - The Universal Protein Resource (UniProt): New Development on Proteomes, Variation and Proteomics	Tech Track talk
17.45 to 19.00	Michele Magrane - Recent developments in UniProt: improving access to protein knowledge	Poster (even) P2
17.45 to 19.00	Tunca Dogan - Computational drug target prediction and validation in PI3K/AKT pathway	Poster (even) A18
17.45 to 19.00	Rabie Saidi - Association Rule Mining for Metabolic Pathway Prediction	Poster (even) L04
17.45 to 19.00	Maria Martin - Improving protein knowledge by Integrating proteomics data with UniProt Reference protein sets	Poster (even) M12

Wednesday, July 1, 2015

Do you have a protein to share with the world?

Did you know that you can submit protein sequences directly to UniProt to be included in our database? We accept protein sequences that have been determined at protein level (via Edman degradation or manual interpretation of tandem-mass spectrometry data) and have clear species identification.

The UniProt home page contains a link in the 'UniProt data' section' that says 'Submit your data'. This takes you to a UniProt help page about data submission.. Just follow the link for submitting protein sequences to get started! Submission is done using SPIN, a web-based tool where you can enter your protein sequence and any associated biological information. All of the information required to create a database entry will be collected during the submission process.

You can submit large batches of multiple sequences as well as individual sequences. A UniProt curator will then review your submission and create a reviewed (UniProtKB/Swiss-Prot) entry for each submitted sequence. Once your protein is inUniProtKB, it will be assigned a stable accession number which can be used in publications. Data can be kept confidential until publication.

Note that translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources are automatically transferred to the TrEMBL section of UniProtKB and do not need to be submitted to UniProtKB separately: http://www.uniprot.org/help/sequence_origin.

The direct submissions we get often include interesting proteins like toxins or anti-freeze proteins and sometimes even proteins from extinct organisms! For example, here's a fragment of a Neanderthal bone protein that we received as a direct submission http://www.uniprot.org/uniprot/P84351.

So if you would like to make your sequences accessible to other researchers, even if they are fragments, just send them in with their species and identification method. If you have any questions, please feel free to write to us at help@uniprot.org.

Monday, May 4, 2015

UniProt Knowledgebase just got smaller!

UniProt release 2015_04 at the beginning of April 2015 saw the number of protein entries in the UniProtKB go from 92,672,207 to 47,262,724. Wondering what happened?

Prior to release 2015_04, UniProtKB had doubled in size in the past year to over 90 million entries with a high level of redundancy. This was especially true for bacterial species where different genomes of the same bacterium have been sequenced and submitted (e.g. 4,080 proteomes for Staphylococcus aureus comprising 10.88 Million entries).

To deal with this redundancy, we developed a procedure to identify highly redundant proteomes within species groups. This procedure was implemented for bacterial species and the sequences corresponding to redundant proteomes (approximately 47 million entries) were deprecated. All of these protein entries belonged to the unreviewed TrEMBL part of the UniProt Knowledgebase. Reviewed Swiss-Prot protein entries will remain unaffected by the procedure.

So how does this procedure actually work? Here we break it down into 4 steps.

Step 1: Group proteomes by taxonomy level

Proteomes can only be redundant to other proteomes of the same taxonomy branch at species level or below (sub-species, strains, etc.).

Step 2: Pairwise comparison of proteomes within each group

We use the CD-Hit 2D program for pairwise comparison of proteomes within each group. Based on the results, we calculate the level of similarity between pairs of proteomes within the groups.

Step 3: Graph analysis

We now select just the proteome pairs with similarity higher than 90%. With these proteomes as nodes, we create a directed weighted graph where edges are the level of similarity. To identify the most redundant proteomes, we rank all proteomes in this graph.

Ranking is by Proteome(Indegree, Outdegree) where for a proteome A:

• Indegree (the higher the better): Number of proteomes that are redundant to proteome A.

• Outdegree (the lower the better): Number of proteomes to which proteome A is redundant.

So, for example:

A(5,1) is better than B(1,1)

Step 4: Elimination of redundant proteomes

Proteomes that rank lowest are the most redundant. These are marked as ‘redundant’ on the UniProt proteomes web portal and protein entries belonging to these redundant proteomes are removed from UniProtKB TrEMBL.

This process is run iteratively to identify all redundant proteomes. All proteomes remain searchable through UniProt’s Proteomes interface (http://www.uniprot.org/proteomes/) and redundant proteome sets are now available for download from the UniProt Archive UniParc.