Thursday, November 17, 2016

Being FAIR at UniProt

We are living in the times of Big Data, with high-throughput genomics leading to massive biological data sets. While this data presents opportunities for innovation and discovery, it also creates immense challenges for open access, data handling, processing and analysis. One of the ways to ensure that the scientific community can get the most out of the data available is to ensure our data is FAIR.

What is FAIR?


Good data management is essential to facilitate knowledge discovery, innovation, integration and reuse by the community after the data publication process. The FAIR Data Principles present a guideline to standardise and improve data management with four foundational principles - Findability, Accessibility, Interoperability, and Reusability. The FAIR Guiding Principles were originally described in full in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/. The FAIR guidelines have been developed keeping in mind the requirements of data use for human readers as well for machine access.

A FAIR UniProt


As one of the world's largest freely available biological data resources, providing key life science data in the most open and accessible manner to the scientific community is at the heart of our mission. Good data management is essential for us to continue to support cutting-edge research in a sustainable and reliable manner. We see first hand the challenges of data management and dissemination and welcome the FAIR guiding principles for data resources. 


UniProt was one of the case studies presented in the original FAIR publication. What makes UniProt FAIR?




All entries are uniquely identified by a stable URL, that provides access to the entry in a variety of formats including a web page, XML, plain-text, RDF and REST services (‘F’ and ‘A’).

Interlinking with more than 150 different databases, every UniProt entry has extensive links into, for example, PubMed, enabling rich citation. These links are key to our user experience in human and machine readable formats ('I').

The entries contain rich metadata (‘F’) that is both human-readable (HTML, text format) and machine-readable (XML and RDF). All our representations use shared vocabularies and ontologies such as GO and ECO (‘I’). Our RDF representation additionally uses the UniProt RDF Schema Ontology and FALDO ('I'&'R').

UniProt strives for inter-operability by representing data that is common with another database in exactly the same way (‘R’). For example, the information about GO terms in UniProt RDF is a pure subset of the information about GO terms in the GO consortium's database. This kind of common data representations allows FAIR RDF databases to fit together as puzzle pieces in the larger life science data world.


Being FAIR


We're not the only fans of FAIR data management. FAIR principles have been adopted as a touchstone for funders and policy groups including the NIH Data Commons, G20 Hangzhao Concensus, the Amsterdam Call for Action on Open Science and the European Open Science Cloud

Challenges ahead


Being FAIR is not without its challenges. Not all formats of data might be FAIR for humans and machine readers alike. At UniProt, we handle this issue by ensuring that we also provide all our data in formats that are FAIR to complement any that might not be. This is an on-going effort with every new data type and data service we provide but is essential to make sure our data is valuable and actionable so that the community can make the most of it.