Inside UniProt: February 2022

UniProt is brought to you by a large team of dedicated scientists who have worked for over 20 years to produce a comprehensive view of protein sequence and biology. This work would not be possible without the generous support of the National Institutes of Health (NIH), the European Molecular Biology Laboratory (EMBL), and the Swiss Secretariat for Education Research and Innovation (SERI). Throughout this period we have been asked regularly by our funders to describe the impact that their support has in both scientific and economic terms. To do this we have developed a number of scientific impact metrics and indicators that are normally only ever seen by a small number of people; we will share some of these scientific impact metrics with you in this blog, while those interested in learning more about the economic impact of resources like UniProt may wish to consult the 2021 report on “The value and impact of EMBL-EBI managed data resources” by Charles Beagrie and associates. We hope that you will be impressed by the numbers, that they will encourage you to continue to support UniProt, and that they may inspire you to apply some of these tools and techniques we describe to study the impact and benefits of your own work.

All the tools that we use to calculate impact metrics are freely available, and this work should be relatively easy to replicate for your own data. We would like to give a special thanks to the teams at PubMed and Europe Pubmed Central (Europe PMC) for all their hard work in developing a series of brilliant tools to organize and index the scientific literature - thank you! Many of the impact metrics described here are widely used by other biomedical data repositories and knowledgebases (according to a recent report from the NIH Office of Data Science Strategy). In what follows we provide a gentle introduction to how we measure them.

Website metrics

UniProt’s website acts as its shop window, the main point of access to the knowledge that we have gathered and organised over the years. We have aimed to make the website simple to use and yet powerful, as well as making all data freely available for download. The key tools for monitoring our website are Google Analytics and web server log analysis.

Figure 1. Google analytics analysis of the UniProt website.

Web statistics are quite challenging to interpret. In classical web server log analysis, the statistics on unique users are based on the number of unique IP addresses counted, but a single person may have multiple devices, each with a unique IP address, while institutions with thousands of users may show only a single IP address. Google Analytics and similar services use cookies to count interactive users and therefore overestimate actual users by the number of devices they use. The increasing usage of cloud computing also leads to an overestimation of user numbers. So these figures should only be interpreted as indicative of the scale of the number of users and general trends in usage over time. All web analytics software also change how certain numbers are calculated from time to time, which can lead to increases or decreases in user numbers that are related to methodology rather than actual usage. We note that at this time there are growing legal issues with using Google Analytics in Europe and we may have to reevaluate the tools we use.

Downloads

In addition to web statistics we also monitor downloads from our ftp sites. Our users typically download in excess of 750,000 Gb of data from UniProt each year - the equivalent of over 83,000 human proteomes every single day of the year! This data may be consumed directly by our users or incorporated into other resources, whose users consume UniProt data indirectly. We talk more about this ‘indirect’ usage of UniProt in the final section of the blog.

Citations

Citations of scientific literature are seen as a primary measure of impact within the sciences. Thus we often report the number of citations to the papers that we have written about UniProt such as our biennial papers in the Nucleic Acids Research Database Issues. But, how to choose our source of citations? Although in the past SCOPUS and ISI were our main sources, we, like most people, tend to use Google Scholar when reporting citations because it is freely available to all and due to its indexing strategy, tends to give the highest number of citations for any particular paper. Table 1 below shows the ten most cited papers describing UniProt. All were published in the annual Nucleic Acids Research Database Issue, which has played a key role in promoting the citation of data and knowledgebases in research - thank you! Taken together, these and other UniProt papers have amassed over 27,000 citations. This is a very large number, but even this does not represent the true impact of UniProt, as many users do not cite our papers. In the next section we delve into how to use full text articles to find this additional, uncited usage of UniProt.

Table 1. Citations of papers describing UniProt ordered by number of citations. All citation numbers are taken from Google Scholar in October 2021. Only the top ten citations are shown for the sake of brevity.

Journal	Year	Citations	Title
Nucleic Acids Research	2015	3753	UniProt: a hub for protein information
Nucleic Acids Research	2019	3332	UniProt: a worldwide hub of protein knowledge
Nucleic Acids Research	2017	3286	UniProt: the universal protein knowledgebase
Nucleic Acids Research	2004	3119	UniProt: the universal protein knowledgebase
Nucleic Acids Research	2005	2002	The universal protein resource (UniProt)
Nucleic Acids Research	2011	1525	UniProt Knowledgebase: a hub of integrated protein data
Nucleic Acids Research	2007	1417	The universal protein resource (UniProt)
Nucleic Acids Research	2012	1292	Reorganizing the protein space at the Universal Protein Resource (UniProt)
Nucleic Acids Research	2014	1250	Activities at the universal protein resource (UniProt)
Nucleic Acids Research	2006	1219	The Universal Protein Resource (UniProt): an expanding universe of protein information
		5682	All other UniProt papers

	Total citations	27709

Using full text of articles

We love it when researchers find UniProt useful for their research and write about it in their papers, but they don’t always cite our papers. Sometimes journals impose limits on the number of references that can be included and sometimes researchers don’t think that data resources need citing. UniProt, like many other well established resources including PubMed, the GeneOntology, or the Model Organism Databases under the Alliance of GenomeResources, has simply become part of the scientific “furniture”, and many researchers use it without any further thought of attribution. Nevertheless, with access to the full text of papers it is possible to go looking for UniProt use without citation of UniProt papers.

Europe PMC provides access to the full text of 7 million papers, and if we hunt for UniProt or one of its synonyms in this set we find over 100,000 mentions of UniProt. The graph below shows the growth of mentions of UniProt over the last two decades. This growth looks quite spectacular, and actually matches quite closely the rapid growth of available full text literature. UniProt is mentioned in about 1.7% of all full text papers available in PubMed, and in our previous work we estimated that about 10% of this body of literature is related to protein biology. So we think this shows UniProt is very widely used and mentioned in the relevant literature.

Figure 2. Graph showing the number of articles with mentions of UniProt or a synonym per year across the Europe PubmedCentral corpus.

The advanced search capabilities at Europe PMC allow us to search in specific sections of papers and identify where UniProt is mentioned (Figure 3). UniProt is most frequently mentioned in the Methods and Results sections of papers highlighting its key role in the design, execution, and interpretation of experiments. The pattern of UniProt mentions is similar to that for popular experimental techniques such as ‘proteomics’ and ‘CRISPR/Cas9’, although these tend to be mentioned more frequently in the ‘Introduction’ and ‘Discussion’ sections, perhaps as they are themselves the subject of much study and debate. Despite these small differences, UniProt is clearly a very important tool that research scientists mention in their publications.

Figure 3. Graph showing the number of mentions of the search terms ‘UniProt’, ‘proteomics’, and ‘CRISPR/Cas9’ across sections of papers.

We also track the mentions of UniProt accession numbers (our stable identifiers for proteins) within the scientific literature. Europe PMC currently supports searches for identifiers from 46 molecular biology databases through its advanced search. In 2020 over 5,000 papers mentioned a UniProt accession number, which helps to connect the literature to not just UniProt, but also the 150 molecular biology data- and knowledgebases that UniProt cross links to. We strongly encourage researchers to cite UniProt accession numbers when publishing studies on proteins and their biology.

Figure 4. Mentions of UniProt accession numbers (stable identifiers) in the literature per year.

Acknowledgements of grant funding

A question we get from funders is whether UniProt is relevant to the particular researchers they support. Researchers are required to acknowledge funding sources in their publications, and these acknowledgements are increasingly provided as structured data that is indexed by PubMed. We can therefore use PubMed to investigate how often UniProt is mentioned in publications that acknowledge our funders, such as the various NIH Institutes that support UniProt. UniProt is widely mentioned across NIH sponsored research, with the most publications from NIGMS researchers (Figure 5). However, if we normalise by the number of publications acknowledging each NIH Institute, we see a different pattern emerging, with publications by NHGRI funded researchers being the most enriched in mentions of UniProt.

Figure 5. Mentions of UniProt in papers acknowledging funding across NIH Institutes (left panel) and normalised to the mean fraction of UniProt mentions across all NIH funding acknowledgements (right panel), where the mean is set to 1.

VOSviewer and authorship networks

The metrics that we have described so far have been great for summarising the magnitude of use of UniProt in the literature. However, they don’t easily convey a sense of who is using UniProt. We have found the fantastic VOSviewer tool developed by Nees Jan van Eck and Ludo Waltman at Leiden University's Centre for Science and Technology Studies (CWTS) to be an excellent tool to summarise this complex data. After downloading VOSviewer you can search Europe PMC to get a selection of relevant papers you want to analyse; for example this query finds papers that mention UniProt and acknowledge funding of NINDS at ePMC.

At the time of writing there were over 2000 search results which we exported in RIS format. We can use this data in VOSviewer to create a co-authorship network shown in Figure 6 below from this data.

Figure 6. Co-authorship network of researchers who mention UniProt and acknowledge NINDS funding in their papers.

It is now much easier to investigate the research papers of the co-author clusters to get an understanding of how the NINDS funded community is using UniProt. For example, the large central green cluster comes from Nick Seyfried and colleague’s work on using proteomics, systems biology, and molecular biology to tackle fundamental questions related to the pathogenesis of Alzheimer’s Disease (AD) and other neurodegenerative disorders. They use UniProt as a source of protein sequences for their peptide identification pipeline. VOSviewer has a wide range of functionalities and we encourage you to explore them.

UniProt in patents

It can be quite difficult to understand how UniProt is used in industry, at least through literature based approaches, with far fewer academic papers published by our industrial colleagues. However, the patent literature gives us a window into the use of UniProt for biotechnology and biomedical applications. We have used the SureCHEMBL resource (https://www.surechembl.org/search/) to identify relevant patents. We can find more than 35,000 patents that mention UniProt or one of its synonyms.

Figure 7. Graph showing growth of UniProt mentions across the patent literature per year.

Mentions of UniProt in the patent literature have been steadily increasing over time, with nearly 3,000 mentions per year in 2020. Patents mention specific proteins and cite UniProt accession numbers as a clear way to identify the specific materials of the patent.

Indirect usage of UniProt

Web analytics and citations of UniProt in literature and patents provide a good picture of the impact of a resource like UniProt, by measuring direct usage of UniProt data, as do downloads. However we know that UniProt data is also consumed indirectly too, by the user communities of many other resources and tools - like the Gene Ontology, Reactome, IMEx/IntAct, STRING, Ensembl, PDBe, InterPro, CATH, ChEMBL, Open Targets, and PHAROS, to name but a few. Each of these resources supports its own large and diverse user community, who also rely indirectly on UniProt data, sometimes perhaps without even realizing it. UniProt data is also used to train hundreds of computational tools to study and predict many aspects of protein biology, such as their functions (as in the Critical Assessment of Functional Annotation) and structures (like AlphaFold from DeepMind). Understanding the complex patterns of data use and reuse that are a feature of resource and tool development will be key to creating a more complete picture of the impact of a resource like UniProt, which, as we have seen in this blog, is considerable.

Thursday, February 10, 2022

Metrics, metrics, metrics: Measuring the Impact of UniProt