Browsing entries in the InterPro website¶
You can get to entry pages in InterPro in lots of different ways. Commonly this will involve clicking on a link to an entry from one of the search methods. This section describes the different types of entries and what you will find for each of their pages.
There are 7 categories of entry pages in InterPro:
The following entry data tabs are available when appropriate. We describe each in detail in the first entry page it appears in. Most entry data tabs will be described within the InterPro entry page.
InterPro entry page¶
An InterPro entry represents a unique protein homologous superfamily, family, domain, repeat or important site based on one or more signatures provided by the InterPro member databases.
InterPro entry pages give a brief description of the entry, name and unique InterPro identifier. The InterPro entry type (homologous superfamily, family, domain, repeat or site) is also indicated by an icon (e.g. a D with a green background for a domain). Member databases contributing signatures to the entry are shown in a box on the right hand side of the page. Overlapping homologous superfamilies and/or Relationships to other entries are indicated where available. Clicking on the star symbol next to the entry name will save the entry as a Favourite. The full list of saved entries is available in the Favourites Entries component in the homepage. Additional browse tabs provide further information on this entry, and are displayed when the information is available.
Types of data that may be available in the browse tabs of an InterPro entry page include:
List of proteins that are included in this entry in a table. Provides the option to display only proteins that have been manually curated in UniprotKB (reviewed), only proteins that have been automatically annotated (unreviewed), or all proteins (both, default).
Provides information about the different domains arrangements for the proteins matching this entry based on Pfam signatures. For InterPro entries, information is provided regarding how the domain is present in protein sequences and what, if any, combinations arise with other entries.
List of species this entry is matching, based on data from UniProt taxonomy. For each organism, the taxonomy identifier and protein count information are provided. The ACTIONS column offers the possibility to:
View all the protein matches in the Proteins tab
Download a FASTA file of the protein matches
View the taxonomy information in the Taxonomy entry page
The information can be displayed in two different ways:
By “Key species”, these are 12 model organisms commonly used in scientific research: Oryza sativa subsp. japonica, Arabidopsis thaliana, Homo sapiens, Danio rerio, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Escherichia coli, Escherichia virus T4, Halobacterium salinarum.
List of all the species the proteins matching this entry are found in.
The type of data displayed can be changed using the website settings, accessible through the InterPro banner Settings sidebar.
List of proteomes whose members are represented by proteins matching this entry. A proteome represents a set of proteins whose genomes have been fully sequenced. A given taxonomy node may have one or more proteomes, for example, to reflect different assemblies of a genome. Proteome data is imported from UniProt proteomes. For each proteome, the same set of actions are available than the ones in Taxonomy, the taxonomy information being replaced by proteome information in the Proteome entry page.
List of structures from the PDBe database that match to protein sequences included in this entry.
At the top of the page a 3D viewer (powered by Mol*) shows an interactive view of the predicted structure for one of the proteins matching the InterPro entry. The structure is coloured by per-residue plDDT score, it can be zoomed in and out, and rotated. Clicking on a residue induces a zoom in effect and displays contacts with surrounding residues, clicking on the blank area around the structure zooms out.
The protein accession and organism are displayed on the left hand side, togheter with links to the corresponding AlphaFold and UniProt websites. The model confidence color scale, determined using the plDDT score, is also displayed, varying from dark blue (very high confidence) to orange (very low confidence).
The data can be downloaded in PDB or mmCIF format, by clicking on the corresponding buttons below the 3D viewer.
On an InterPro entry page, below the 3D viewer, a table containing the list of UniProt accessions matching the InterPro entry for which structure predictions have been generated is shown. For each protein it is possible to:
List of proteins characterised in experimentally proven data in which the proteins matching an entry are involved in protein:protein interactions.
The data can be filtered and sorted by UniProt accession (protein), resource (evidence) and confidence score. The sort is possible by clicking on the arrow symbol of the corresponding column. The filtering is available by clicking on the funnel symbol and selecting the filter to apply.
Member database page¶
InterPro provides entry pages for each signature that a member database holds. This includes signatures that have not yet been, or can’t be, integrated into InterPro (unintegrated signatures).
Member database signature entries provide information about which database the signature is from, the signature identifier, the type of entry as defined by the member database (e.g. family, domain or site), and the short name given to the entry by the member database.
Some InterPro member databases create groups of families that are evolutionary related. Pfam calls them clans, CDD uses the term superfamily and, for PIRSF and Panther the concept is associated with the parent families of their hierarchy. We use the umbrella term Set to refer to all of them. When available, the set to which the signature belongs to is indicated.
The right hand side of the page provides links to the InterPro entry in which this signature has been integrated, and an external link to the signature on the member database’s website.
For signatures provided by the Pfam member database, a short extract of the wikipedia page is also displayed when available to complete the description.
In addition to the Proteins, Domain architectures, Taxonomy, Proteomes and Structures tabs member database pages may also display information in the following additional tabs: Signature, trRosetta, Alignment and Curation.
The signature representing the model that defines the entry is visualised in this page as a logo, using Skylign. The logo data is displayed for the Pfam, PANTHER, PIRSF, SFLD and TIGRFAM member databases.
The visualisation displays the amino acid conservation for each residue in the model. To navigate large logos, the user can drag the rendered area to a desired position. Alternatively, the user can input a residue number to be viewed. When selecting a particular residue in the logo, the probabilities of each amino acid are displayed in the bottom part.
The field of protein structure prediction has greatly advanced over recent years such that deep-learning based methods are now able to predict high quality de novo protein structures. Structure models and contact maps have been created for some of the Pfam families that do not have a structure in the PDB. They are available under the trRosetta tab of Pfam signature pages. The models are generated using the automated trRosetta modeling pipeline [2, 3] developed by the Baker group and tested at CASP14. The primary driving force for model building are residue-residue geometry constraints derived from coevolutionary data (see figure below) in the Pfam UniProt alignments, and top scoring structural templates from deep learning.
An accurate contact prediction relies on there being a large number of sequences with sufficient diversity in the alignment, so that residue-residue covariance can be distinguished from lineage effects. This means that structure prediction is not possible for all Pfam families, as not all of them have the required number and diversity of sequences in the Pfam alignment.
The 3D structure of the model is displayed in the 3D viewer, and can be zoomed in and out, and rotated. Clicking on a residue in the viewer induces a zoom in effect and displays contacts with surrounding residues, clicking on the blank area around the structure zooms out. The structure is coloured by per-residue plDDT score with a rainbow gradient going from blue (high confidence) to red (low confidence).
Below the 3D viewer, the Heatmap visualisation displays the residue contacts using the distance metric. Hovering on the heatmap highlights the contacts in the 3D structural model.
The contact map information is displayed for the Pfam family SEED alignment. Hovering or clicking on a contact position highlights its connection to other residues in the alignment as well as on the 3D structure. The model data can be downloaded by clicking on the Download button located below the 3D viewer.
Hover or click on a circle to see the contact residues for the column under the circle
Contacts for the column selected will be shown with connecting lines
The probability threshold of the residues being closer than 8Å can be changed using the slider. Decreasing the probability will increase the number of contacts.
The highlighted column selected in step 1 will be shown in red on the structure model. The residues that are in contact will be shown in blue.
This section allows users to view and download any available alignment file that is associated with the current member database signature. Currently, the alignment files are only available for the Pfam member database, but hopefully we will be able to include alignments for other member databases in the future.
First, one of the available alignments has to be selected. For example in the image below the user has selected the “seed” alignment. If the selected alignment has more than 1000 sequences, a warning message appears to inform users that big alignments can cause memory issues in the browser. A compressed file (gzip) of the current alignment is available by clicking on the Download button.
Interacting with the grey navigation bar over the sequences allows users to navigate the alignment; dragging the left and right limits of the navigation bar allows users to zoom to a particular position or adjust the zoom level. Alternatively, the zoom level can also be defined by scrolling up/down while holding the [ctrl] key. Scrolling up/down allows to move other sequences in the alignment into the visible area of the viewer.
This section provides information about the curation of the signature. Currently, it is only available for the Pfam member database. It is divided into 2 subsections:
Curation: details about Pfam curators and Sequence ontology
HMM information: displays the HMM building command used and offers the possibility to download the HMM profile defining the signature
Protein entry page¶
The Protein entry page contains information on a specific protein provided by UniProt. Protein pages can be accessed either by entering a UniProt accession in a Text search or by clicking on a protein accession from the Proteins tab in an entry page.
The protein page provides the protein accession, the short name given to the protein by Uniprot, the length of the protein sequence, species in which the protein is found, the proteome it belongs to and a brief description of the protein’s function where known. All the InterPro family entries this protein is matching are listed under “Protein family membership”. An external link to the protein entry in Uniprot, as well as the export of the matches in TSV format and the possibility to perform a HMMER search or an InterProScan search are provided on the right hand side of the page.
The protein entry page also displays the protein sequence viewer to show the associated domains, sites etc.
When available, different isoforms of the protein can be selected to compare their InterPro matches with the consensus protein sequence. When an isoform is selected, a new protein sequence viewer corresponding to the selection is displayed and the url is update to reflect the change. The isoform matches can also be viewed side by side with the consensus protein sequence by clicking on the split icon after selecting an isoform.
List of InterPro entries that include this entity. The results can be filtered by member databases using the dropdown box located on the left side of the header of the result table. This functionality is available for all the tables presenting InterPro entries in the website.
This tab shows the protein FASTA sequence. The full sequence or part of the sequence (by selecting the region of interest) can be used to perform two types of search, available on the right side of the screen: InterProScan search or HMMER search, which redirects to the corresponding pages.
Structure entry page¶
InterPro provides entries for all the structures available in the Protein Data Bank in Europe (PDBe). A structure search can be performed by clicking on a structure provided in a results list or by entering the protein structure identifier in the Quick search box (magnifying glass symbol) or by performing a Text search.
At the top of the structure page, general information about the structure is displayed: the structure’s accession number (PDB ID), resolution, release date, the method used to determine the structure (e.g. “Xray”) and the chains composing the structure. An external link to the structure entry in the PDBe database is provided on the right hand side of the page.
Following, the general information section, a 3D viewer (powered by Mol*) shows an interactive view of the 3D structure. Clicking on a residue in the viewer induces a zoom in effect and displays contacts with surrounding residues, clicking on the blank area around the structure zooms out. Below it, the protein sequence viewer has an extra category representing the secondary structure information. Hovering over one of the tracks highlights the corresponding region of the protein structure in the 3D structure viewer.
More information is available on the corresponding train online section.
Taxonomy entry page¶
Taxonomy pages display the name, taxonomy ID, lineage and children nodes for a particular taxon. Any reference to this taxon from another page throughout the website will link to this page.
The overview also includes a graphical representation of the lineage of the selected taxon. The nodes in the visualisation are also links, so you can jump to the page of a particular taxon of interest.
Proteome entry page¶
The proteome entry page displays general information provided by UniProt: its ID, strain, and a link to the related species.
The image shows the proteome page for C. elegans, whose proteome ID is UP000001940, and as you can see from the counters in the tabs has 9K related InterPro entries, 27K proteins and 363 structures. Notice this data is for InterPro version 81.0, and it is used here just as an example.
Set entry page¶
Some InterPro member databases create groups of families that are evolutionary related, called sets. This page offers an overview of a specific set provided by a member database, it includes a short description and an interactive view of the signatures included in the set. For sets provided by the Pfam member database, an additional section provides literature references, when available.
List of signatures included in the clan and their alignment with other signatures in the clan.