Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots.

Frontiers in genetics | 2013

Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.

Pubmed ID: 24348509 RIS Download

Research resources used in this publication

None found

Antibodies used in this publication

None found

Associated grants

None

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


khmer (tool)

RRID:SCR_001156

Software library and suite of command line tools for working with DNA sequence that takes a k-mer-centric approach to sequence analysis. It is primarily aimed at short-read sequencing data such as that produced by the Illumina platform.

View all literature mentions

GitHub (tool)

RRID:SCR_002630

A web-based hosting service for software development projects that use the Git revision control system offering powerful collaboration, code review, and code management. It offers both paid plans for private repositories, and free accounts for open source projects. Large or small, every repository comes with the same powerful tools. These tools are open to the community for public projects and secure for private projects. Features include: * Integrated issue tracking * Collaborative code review * Easily manage teams within organizations * Text entry with understated power * A growing list of programming languages and data formats * On the desktop and in your pocket - Android app and mobile web views let you keep track of your projects on the go.

View all literature mentions

WormBase (tool)

RRID:SCR_003098

Central data repository for nematode biology including complete genomic sequence, gene predictions and orthology assignments from range of related nematodes.Data concerning genetics, genomics and biology of C. elegans and related nematodes. Derived from initial ACeDB database of C. elegans genetic and sequence information, WormBase includes genomic, anatomical and functional information of C. elegans, other Caenorhabditis species and other nematodes. Maintains public FTP site where researchers can find many commonly requested files and datasets, WormBase software and prepackaged databases.

View all literature mentions

NCBI BLAST (tool)

RRID:SCR_004870

Web search tool to find regions of similarity between biological sequences. Program compares nucleotide or protein sequences to sequence databases and calculates statistical significance. Used for identifying homologous sequences.

View all literature mentions

QIAGEN (tool)

RRID:SCR_008539

A commercial organization which provides assay technologies to isolate DNA, RNA, and proteins from any biological sample. Assay technologies are then used to make specific target biomolecules, such as the DNA of a specific virus, visible for subsequent analysis.

View all literature mentions

ABySS (tool)

RRID:SCR_010709

Software providing de novo, parallel, paired-end sequence assembler that is designed for short reads. ABySS 1.0 originally showed that assembling human genome using short 50 bp sequencing reads was possible by aggregating half terabyte of compute memory needed over several computers using standardized message passing system. ABySS 2.0 is Resource Efficient Assembly of Large Genomes using Bloom Filter. ABySS 2.0 departs from MPI and instead implements algorithms that employ Bloom filter, probabilistic data structure, to represent de Bruijn graph and reduce memory requirements.

View all literature mentions

TBLASTX (tool)

RRID:SCR_011823

A web-based tool used to search translated nucleotide databases using a translated nucleotide query.

View all literature mentions

CEGMA (tool)

RRID:SCR_015055

THIS RESOURCE IS NO LONGER IN SERVICE, documented on January 19, 2022. Tool to annotate core genes in eukaryotic genomes (that was replaced by BUSCO). Its resulting core gene dataset can be used to train a gene finder or to assess the completeness of the genome or annotations.

View all literature mentions