Wednesday, March 2, 2016

Pan proteomes in UniProt

A proteome is the set of proteins thought to be expressed by an organism and is typically obtained from the translation of fully sequenced, annotated genome. The last few years have seen a vast increase in the submission of multiple genomes for the same or closely related organisms. To help users to find the most relevant and best-annotated set of sequences for each taxon we now have the twin concepts of Reference proteomes and the newly introduced Pan proteomes.

Reference proteomes are chosen to provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity found within UniProtKB. These proteomes - both community selected and computationally determined - include model organisms and other proteomes of interest to biomedical and biotechnological research.

A pan proteome is the full set of proteins thought to be expressed by a group of highly related organisms (e.g. multiple strains of the same bacterial species). Pan proteomes provide a representative set of all the sequences within a taxonomic group and capture unique sequences not found in the group’s reference proteome. UniProtKB pan proteomes encompass all non-redundant proteomes and are aimed at users interested in phylogenetic comparisons and the study of genome evolution and gene diversity.

When a proteome has proteins that are part of a larger pan proteome, you will see it indicated on the proteome page in the 'Pan proteome' row. You will also see a link to download the full fasta sequence set.

You can also download pan proteome sets from the UniProt FTP site through the 'Pan proteomes' sub directory.

For each reference proteome cluster, also known as representative proteome group (RPG) (Chen et al., 2011), a pan proteome is a set of sequences consisting of all the sequences in the reference proteome, plus the addition of unique protein sequences that are found in other species or strains of the cluster but not in the reference proteome. Click here to find more about how we compute pan proteome sets.