Tuesday, December 5, 2017

How can UniProtKB help the gene regulation community?

This question was asked at a recent meeting of a group discussing the availability of information about the regulation of gene expression (http://greekc.org/). 

The first thing most researchers in this field ask for is simply a list of known transcriptional regulators. These can easily be retrieved from, for example, the human proteome by using the Advanced Search to specify the Keyword as “Transcription regulation” and Organism as “Homo sapiens”.

Adding an additional keyword to the search “DNA-binding” will limit the search to entries annotated as DNA-binding transcription factors. Selecting ‘Reviewed’ entries using the filters on the left-hand side bar to restrict the results to those entries in UniProtKB/Swiss-Prot, will complete your search.

If you are just interested in the list of UniProtKB accessions or protein names, you can export it using the download functionality and selecting your favourite format (select “List” for just getting the accession numbers).

However, if you want to review information about any of these entries, for example human TP63 (UniProt Accession Q9H3D4), clicking on the accession number will enable you to access a wealth of protein information. For example, you may wish to identify the DNA-binding region of the protein. The “Display” menu on the left hand side of the UniProtKB entry offers options to see the protein sequence features in a tabular view via the Feature table.


or in a graphical view with the ProtVista feature viewer (accessible  via the ‘Feature viewer’ link).
From this, the “Variants” track can be expanded to show the individual single nucleotide polymorphisms and disease they are associated with (see figure below). It can be seen that many single amino-acid variants fall into this region. The data can be filtered to reveal only the disease-related variants and/or those reviewed by UniProtKB. Clicking on a given variant position will show the annotation available for the variant.


This information is also detailed in the entry in the Pathology & Biotech section, sorted by disease type.

The feature table summarizes all the different sequence features that have been annotated. Then you could find the position of a transcription activation domain in this protein:


and that the activity of this domain has been confirmed by mutagenesis studies:

The same is true exploiting the Feature viewer. Now focusing on the transcription activation domain, you could review the mutagenesis track 

More information on the regulation of this particular transcription factor and the cellular processes it regulates, can be found in the Function section of the entry both in free-text form and the more structured Gene Ontology annotations. 

Thursday, November 9, 2017

Visualising protein interactions in UniProt

The UniProtKB entries include an Interaction section, which details the protein’s binary interactions with other proteins, using a high-quality dataset supplied by the IMEx Consortium.

You can now view the binary interactions in a graph that shows the interaction partners of your protein and also shows which of those partners interact with each other. For example, here is the interaction matrix for the human E3 ubiquitin-protein ligase parkin protein.

Dots dots dots

Each interaction edge is represented by a dot, of which the intensity represents the number of experiments supporting the interaction. Hovering over the dot highlights both partners.

Information on click

Clicking on an interaction dot brings up a popup window with details about the interaction.

This window contains more information about the interacting partners:

  • Names
  • Identifiers and link to UniProt entry
  • List of diseases, and link to the relevant section of the UniProt entry
  • Subcellular location
  • Number of experiments, and link to IntAct

Filtering the display

We currently have two filters which allow users to filter out data from the graph. They apply if any of the partners in the interaction satisfy the selected criteria.

The two filters are:

  • Subcellular location: this is a tree-based selection menu which allows users to filter proteins based on their location within the cell
  • Disease: only show proteins which are involved in the specified disease(s)
We are working on enhancing this view further. Are there any more filters or other improvements that you would like to suggest? Let us know!

Wednesday, November 1, 2017

Visualising sub-cellular locations in UniProt

The UniProt Knowledgebase provides protein entries covering key aspects of protein biology divided into sections that group related information.

One of the sections on the protein entry pages is Subcellular Location. This section provides information on the location and the topology of the mature protein in the cell. You can now visually explore the subcellular location in UniProtKB entries. The visualisation presents image templates from COMPARTMENTS https://compartments.jensenlab.org/ combined with protein location data from UniProt (expert annotation, rule-based automatic annotation) and imported from Gene Ontology (GO) annotation. The figure below shows the subcellular location view from the Human Copper-transporting ATPase 2 protein.

Colour-coded by evidence

The subcellular locations in which the protein is found are shown using colours and titles for the compartments. The colours can be gold which indicates 'Manual annotation' and blue which indicates 'Automatic computational assertion'. These colours are also reflected in the clickable evidence tags on the right hand side in the tabs showing the text annotation.

Source-based annotation tabs

There are two tabs based on the sources of annotation, one for UniProt annotation and one for GO (Gene Ontology) annotation. You can click on the tabs to view the specific annotation from that source. The image on the left hand side will update to reflect the annotation tab that you are on.

Click to highlight

You can also click on a coloured subcellular location compartment to quickly highlight the corresponding annotation on the right hand side. 

Try it out and let us know what you think! What else would you like to see visually in the UniProtKB entry?

Thursday, July 13, 2017

Search and You Shall Find

Have you ever searched for your protein in UniProt and found too many results? There's a few strategies that can help narrow down to the right result.

Free-text searching

The most common way to begin a search is to type your search terms directly into the main search bar. Because the UniProtKB search algorithm ranks results on the basis of properties like relevance, annotation score and entry status (reviewed or unreviewed), a free-text search will often return the most relevant hits at the top of your results.

For example, let's say you're searching for proteins belonging to gene SEP1 from the species C. elegans. Let's try searching free-text for 'sep1 c.elegans'. The results bring back 4 proteins and the C. elegans protein for the gene SEP1 is right on top (as of release 2017_07).

You'll notice that the results encompass genes other than SEP1. This is because entering SEP1 as a free-text search will bring back all entries that mention SEP1 anywhere in the text, including unexpected hits for protein entries that are not SEP1 but mention SEP1 as an interactor for example. Nevertheless, your protein of interest appears at the top of the results set.

However when you're searching for something that is likely to bring back a lot more results (like 'Kinase'), you might not be as lucky with free-text searching.

Filtering your search results

One way of narrowing down your results set is by using the Filters on the UniProtKB results page. The UniProtKB entry page provides a Reviewed/ Unreviewed status filter, Organism filter and a special Search terms filter. You can use the Popular organisms filter to quickly select your desired organism if your query has found results in more than one organisms. The 'Search terms' filter lets you select a category for each of your search terms. For example, here we can choose Search terms "sep1" as 'gene name' to indicate what type of term this is. This will ensure that the website search interprets the search terms in the desired way.

Auto-completion for free-text searching

UniProt offers another solution to help you define your search. When you type a search term into the UniProt search box, the site presents an autocompletion suggestion that offers a category to define your term. For example, when you type in 'absorption', you see the autocompletion suggestion 'annotation: absorption'. When you type in 'Ensembl', you see the autocompletion suggestion 'Database:Ensembl' amongst other databases that match your term.

If the suggestion matches the type of category you had in mind, selecting it and then launching the search will help find better matches.

Advanced search for better targeted results

The most powerful way of searching UniProtKB is to use the Advanced search option. It allows you to search for entries by restricting your search to specific categories of data. To bring up the advanced search query builder just click on the 'Advanced' link in the search bar. 

Now you can select the categories from a dropdown (the default is 'All' categories) and then specify a related search term. You can enter any number of fields like this and also change the boolean relationships between them (AND, OR, NOT). For example, you could select 'Gene name' and enter SEP1 then in the next row select 'Organism' and enter Caenorhabditis elegans. You will see that an autocompletion functionality brings up a number of suggestions matching your term. Select “Caenorhabditis elegans [6239]”, where 6239 is the taxonomy identifier for Caenorhabditis elegans. Submitting this will return just your one specific search result (the same as when using filters as described above), as opposed to the 4 in the free text search.

The advanced search category dropdown provides a huge number of options in a nested tree structure. For example, if you select 'Function' you will see that there are a number of sub-topics available within it such as 'Active site'. We recommend clicking on the 'help' link in the advanced search widget which takes you to a page with the entire tree structure listed out. This will help you find your topic of choice within the advanced search category dropdown. Note that the search topics (sub-topics) in the advanced search match those in the UniProtKB entry, so it is recommended to familiarize yourself with the entry content to better exploit the search capabilities.

You will note that both approaches – using filters and advanced search – lead you to the same query and URL, namely
gene:sep1 AND organism:"Caenorhabditis elegans [6239]". This means that you can easily combine both approaches, and you can open any query, whether you built it by typing terms or by using filters, with the advanced search to include additional criteria.

Using category selections to find a set of proteins 

For most search categories, entering text in the term box is optional. If you select 'Active site' and choose to leave the term box blank, you will simply get back all proteins with an 'Active site' associated with them. Thus if you wanted to find all human proteins annotated to be related to a disease, go into the advanced search and select the category 'Organism' and enter human, then select 'Pathology and Biotech' -> 'Disease' and leave the term empty. Hit enter and you will find the results set of all human proteins associated with a disease.  

To quickly confirm this, you can click on the 'Columns' button and add the 'Involvement in disease' column to your results page. You can now see all the information about diseases within your results table.

Tuesday, April 11, 2017

Curating the C. elegans kinome in UniProt

One of the key strengths of the UniProt Knowledgebase is the expert curation that goes into the entries in the reviewed Swiss-Prot section. Here we give you an insight into a recent curation project to review and annotate the kinome of the nematode worm Caenorhabditis elegans. The image below shows an overview of the project, including the proportion of Swiss-Prot entries for C. elegans kinases at the start and end of the project, the breakdown of the kinome into kinase families and a word cloud of the most prevalent GO terms found in the C. elegans kinome. 

This project builds on previous work in the group to curate the human and mouse kinomes which was completed and published in 2008. In addition to the ongoing update of the human and mouse kinomes as new information becomes available, we decided to extend curation efforts to the C. elegans kinome. C. elegans contains 438 kinases and almost half have been functionally characterized, highlighting that C. elegans is a valuable model organism to understand the role of kinases in biological processes. In addition, studies in C. elegans can shed light on human biology and disease. For example, genetic studies of C. elegans lrk-1, a homolog of human kinase LRRK2 which is involved in Parkinson’s disease, have helped to shed light on its role in the development of the nervous system and provided some clues to help understand the progressive neurodegeneration caused by LRRK2 mutations.

Some key characteristics of the C.elegans kinome are:

  • It contains the same proportion of kinases as the human proteome (approximately 2% of both proteomes)
  • It contains members from all 10 kinase groups
  • Kinase domains are not only found in cytoplasmic proteins but also in transmembrane proteins while one C. elegans kinase, H03A11.1, is thought to be secreted based on similarity to human FAM20C which has been shown experimentally to be secreted
  • Pseudokinases represent 9% of the C. elegans kinome
  • C. elegans has many unique kinases including members of the CK1 group which have not been studied experimentally but understanding the function of these kinases could provide valuable information for developing strategies to eliminate parasitic worms

We recently published an article describing this effort in the Biochemical Journal so you can read the results of the project here http://www.biochemj.org/content/474/4/493!

Thursday, November 17, 2016

Being FAIR at UniProt

We are living in the times of Big Data, with high-throughput genomics leading to massive biological data sets. While this data presents opportunities for innovation and discovery, it also creates immense challenges for open access, data handling, processing and analysis. One of the ways to ensure that the scientific community can get the most out of the data available is to ensure our data is FAIR.

What is FAIR?

Good data management is essential to facilitate knowledge discovery, innovation, integration and reuse by the community after the data publication process. The FAIR Data Principles present a guideline to standardise and improve data management with four foundational principles - Findability, Accessibility, Interoperability, and Reusability. The FAIR Guiding Principles were originally described in full in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/. The FAIR guidelines have been developed keeping in mind the requirements of data use for human readers as well for machine access.

A FAIR UniProt

As one of the world's largest freely available biological data resources, providing key life science data in the most open and accessible manner to the scientific community is at the heart of our mission. Good data management is essential for us to continue to support cutting-edge research in a sustainable and reliable manner. We see first hand the challenges of data management and dissemination and welcome the FAIR guiding principles for data resources. 

UniProt was one of the case studies presented in the original FAIR publication. What makes UniProt FAIR?

All entries are uniquely identified by a stable URL, that provides access to the entry in a variety of formats including a web page, XML, plain-text, RDF and REST services (‘F’ and ‘A’).

Interlinking with more than 150 different databases, every UniProt entry has extensive links into, for example, PubMed, enabling rich citation. These links are key to our user experience in human and machine readable formats ('I').

The entries contain rich metadata (‘F’) that is both human-readable (HTML, text format) and machine-readable (XML and RDF). All our representations use shared vocabularies and ontologies such as GO and ECO (‘I’). Our RDF representation additionally uses the UniProt RDF Schema Ontology and FALDO ('I'&'R').

UniProt strives for inter-operability by representing data that is common with another database in exactly the same way (‘R’). For example, the information about GO terms in UniProt RDF is a pure subset of the information about GO terms in the GO consortium's database. This kind of common data representations allows FAIR RDF databases to fit together as puzzle pieces in the larger life science data world.

Being FAIR

We're not the only fans of FAIR data management. FAIR principles have been adopted as a touchstone for funders and policy groups including the NIH Data Commons, G20 Hangzhao Concensus, the Amsterdam Call for Action on Open Science and the European Open Science Cloud

Challenges ahead

Being FAIR is not without its challenges. Not all formats of data might be FAIR for humans and machine readers alike. At UniProt, we handle this issue by ensuring that we also provide all our data in formats that are FAIR to complement any that might not be. This is an on-going effort with every new data type and data service we provide but is essential to make sure our data is valuable and actionable so that the community can make the most of it. 

Monday, October 10, 2016

Automatic learning based annotation in UniProt

Have you ever wondered how data mining and machine learning techniques might help in knowledge curation? Let us introduce you to the Statistical Automatic Annotation System (SAAS) in UniProt!

UniProt has an automatic annotation project that enhances unreviewed TrEMBL entries in the UniProt Knowledgeable (UniProtKB) by enriching them with automatically predicted annotations. SAAS is one of the systems that contribute to this project. 

SAAS is an automatic system with quality validation input from curators, such as exclusion of some data types as not appropriate for propagation. It learns on the properties present in the reviewed UniProtKB (Swiss-Prot) entries and uses the following attribute types to define the learning entries: InterPro protein family, taxonomy and sequence length. This combination allows SAAS to generate rules to annotate protein properties such as function, catalytic activity, pathway membership, subcellular location, protein names and feature predictions.

SAAS based evidence for UniProtKB annotation
When an annotation is added to an entry based on an automatic annotation from a SAAS rule, the evidence tag indicates this along with a link to the rule itself.

Browsing SAAS rules
In order to browse the dataset to view rules of interest, click on the dropdown next to the search box in the UniProt website and select ‘SAAS’. Now enter a query and hit the search button.

Exploring SAAS rule pages
Conditions are listed on the left hand side of the rule page and annotations are on the right hand side. If a condition holds true then the corresponding annotation is applied. 

SAAS annotation data is recalculated for every UniProt release to ensure that the annotations are accurate and up-to-date.