Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

EVEREST: automatic identification and classification of protein domains in all protein sequences.

BMC bioinformatics | 2006

Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again.

Pubmed ID: 16749920 RIS Download

Research resources used in this publication

None found

Antibodies used in this publication

None found

Associated grants

None

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


InterPro (tool)

RRID:SCR_006695

Service providing functional analysis of proteins by classifying them into families and predicting domains and important sites. They combine protein signatures from a number of member databases into a single searchable resource, capitalizing on their individual strengths to produce a powerful integrated database and diagnostic tool. This integrated database of predictive protein signatures is used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures. You can access the data programmatically, via Web Services. The member databases use a number of approaches: # ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST. # PROSITE patterns: provider of simple regular expressions. # PROSITE and HAMAP profiles: provide sequence matrices. # PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs). # PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs). Your contributions are welcome. You are encouraged to use the ''''Add your annotation'''' button on InterPro entry pages to suggest updated or improved annotation for individual InterPro entries.

View all literature mentions

ADDA - Automatic Domain Decomposition Algorithm (tool)

RRID:SCR_007546

This is a web interface for ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. ADDA is downloadable. There are three ways in which you can retrieve a protein sequence and its domains from ADDA. Sequences can be located using sequence identifiers and/or accession numbers, using a identical fragment lookup, or by running BLAST against all sequences in ADDA. ADDA is a protein sequence clustering algorithm. It takes a set of sequences and returns domain families. ADDA has two steps corresponding to the two aspects of the protein sequence clustering domain. First, ADDA splits protein sequences into domains. The idea behind ADDA is in principle the application of Occam''s razor; the goal is to describe the diversity of protein sequences with a minimal set of protein domains. The algorithm behind ADDA approximates this minimal set. In practice ADDA works by looking at where BLAST alignments are located on the sequence and splits the sequences, so that as few as possible alignments are cut by domain boundaries and that as many alignments as possible stretch over complete domains. Secondly, ADDA takes all the domains and then arranges them in a minimum spanning tree, where the similarity between two domains is determined by their relative overlap given a BLAST alignment. Each link in the tree is then checked by a pairwise profile-profile comparison and links below a threshold are removed. The remaining connected components are then taken to represent protein domain families.

View all literature mentions