Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

Ultrafast clustering algorithms for metagenomic sequence analysis.

Briefings in bioinformatics | 2012

The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.

Pubmed ID: 22772836 RIS Download

Associated grants

  • Agency: NHGRI NIH HHS, United States
    Id: R01 HG005978
  • Agency: NCRR NIH HHS, United States
    Id: R01 RR025030
  • Agency: NCRR NIH HHS, United States
    Id: R01RR025030
  • Agency: NHGRI NIH HHS, United States
    Id: R01HG005978

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (tool)

RRID:SCR_002676

THIS RESOURCE IS NO LONGER IN SERVICE, documented May 26, 2016; however, the URL provides links to associated projects and data. A suite of data query, download, upload, analysis and sharing tools serving the needs of the microbial ecology research community, and other scientists using metagenomics data.

View all literature mentions

NCBI Sequence Read Archive (SRA) (tool)

RRID:SCR_004891

Repository of raw sequencing data from next generation of sequencing platforms including including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT. In addition to raw sequence data, SRA now stores alignment information in form of read placements on reference sequence. Data submissions are welcome. Archive of high throughput sequencing data,part of international partnership of archives (INSDC) at NCBI, European Bioinformatics Institute and DNA Database of Japan. Data submitted to any of this three organizations are shared among them.

View all literature mentions

NCBI (tool)

RRID:SCR_006472

A portal to biomedical and genomic information. NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information for the better understanding of molecular processes affecting human health and disease.

View all literature mentions

SEED (tool)

RRID:SCR_002129

The SEED is a framework to support comparative analysis and annotation of genomes. The cooperative effort focuses on the development of the comparative genomics environment and, more importantly, on the development of curated genomic data. Curation of genomic data (annotation) is done via the curation of subsystems by an expert annotator across many genomes, not on a gene by gene basis. From the curated subsystems we extract a set of freely available protein families (FIGfams). These FIGfams form the core component of our RAST automated annotation technology. Answering numerous requests for automatic Seed-Quality annotations for more or less complete bacterial and archaeal genomes, we have established the free RAST-Server (RAST=Rapid Annotation using Subsytems Technology). Using similar technology, we make the Metagenomics-RAST-Server freely available. We also provide a SEED-Viewer that allows read-only access to the latest curated data sets. We currently have 58 Archaea, 902 Bacteria, 562 Eukaryota, 1254 Plasmids and 1713 Viruses in our database. All tools and datasets that make up the SEED are in the public domain and can be downloaded at ftp://ftp.theseed.org

View all literature mentions

UniGene (tool)

RRID:SCR_004405

THIS RESOURCE IS NO LONGER IN SERVICE. Documented on January 11, 2023. Web tool for an organized view of the transcriptome. Collection of the computationally identified transcripts from the same locus. Information on protein similarities, gene expression, cDNA clones, and genomic location. System for automatically partitioning GenBank sequences into a non redundant set of gene oriented clusters.

View all literature mentions

CD-HIT (tool)

RRID:SCR_007105

THIS RESOURCE IS NO LONGER IN SERVICE. Documented on February 28,2023. Software program for clustering biological sequences with many applications in various fields such as making non-redundant databases, finding duplicates, identifying protein families, filtering sequence errors and improving sequence assembly etc. It is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset. The CD-HIT package has CD-HIT, CD-HIT-2D, CD-HIT-EST, CD-HIT-EST-2D, CD-HIT-454, CD-HIT-PARA, PSI-CD-HIT, CD-HIT-OTU and over a dozen scripts. * CD-HIT (CD-HIT-EST) clusters similar proteins (DNAs) into clusters that meet a user-defined similarity threshold. * CD-HIT-2D (CD-HIT-EST-2D) compares 2 datasets and identifies the sequences in db2 that are similar to db1 above a threshold. * CD-HIT-454 identifies natural and artificial duplicates from pyrosequencing reads. * CD-HIT-OTU cluster rRNA tags into OTUs The usage of other programs and scripts can be found in CD-HIT user''s guide. CD-HIT was originally developed by Dr. Weizhong Li at Dr. Adam Godzik''s Lab at the Burnham Institute (now Sanford-Burnham Medical Research Institute).

View all literature mentions