Understanding the association of genetic variation with its functional consequences in proteins is essential for the interpretation of genomic data and identifying causal variants in diseases. It is crucial to integrate knowledge from genomic annotation with known protein function. UniProt is mapping genomic and protein data to build a better understanding of functional effects of variants. UniProt’s recent publication titled ‘UniProt genomic mapping for deciphering functional effects of missense variants’ describes this work of mapping UniProtKB human sequences and positional annotations, such as active sites, binding sites, and variants to the human genome (GRCh38). This mapping allowed the creation of public genome track hubs for viewing variants location in protein functional domains on genome browsers (Fig 1) and also allows data integration and comparison with other resources that map their data to the genome.
The genome track hubs and related UniProtKB files are downloadable from the UniProt FTP site and discoverable as public track hubs at the UCSC and Ensembl genome browsers.
Fig 1: Protein functional domains in genome browsers
The paper compared ClinVar’s clinically annotated single nucleotide polymorphism (SNP) data to UniProt features and variant annotation. To get an overview of variants in different functional features we examined ClinVar gold star rated SNPs that overlap selected protein features. As illustrated in Fig 2., a missense variant in a key functional feature of a protein may alter a protein’s structure and function and if severe enough might be classified as harmful. This suggests that a functional feature could be a useful attribute to be included in variant prediction algorithms, including machine‐learning approaches.
Fig 2: Percentage of ClinVar SNPs in each annotation category that exist in each feature type, underlying data table in supplemental methods.
The paper also presents a direct comparison of ClinVar SNPs annotation with UniProtKB natural variant annotation that affects the same amino acid. In general the annotation agrees with ~86% of co-located UniProtKB disease-associated variants mapping to 'pathogenic' ClinVar SNPs. The reasons for disagreements were examined and discussed.
A related publication by UniProtKB/Swiss-Prot curators looked in more detail at the concordance of variant interpretations from UniProtKB/Swiss-Prot with those of ClinVar. This publication also looked at the effect of re-curatingUniProtKB/Swiss-Prot variants - using guidelines of the American College of Medical Genetics and Genomics (ACMG) and tools from ClinGen. See “An enhanced workflow for variant interpretation in UniProtKB/Swiss-Prot improves consistency and reuse in ClinVar”.
The work described in these papers provides a basis for better integration and standardization of UniProtKB annotation with ClinVar and ClinGen.
UniProt hopes to investigate these and related topics in the future, and as a publicly funded resource, UniProt encourages others to further analyze the data as well.