Mock community taxonomic classification performance of publicly available shotgun metagenomics pipelines.

E Michael Valencia | Katherine A Maki | Jennifer N Dootz | Jennifer J Barb

Scientific data | 2024

Shotgun metagenomic sequencing comprehensively samples the DNA of a microbial sample. Choosing the best bioinformatics processing package can be daunting due to the wide variety of tools available. Here, we assessed publicly available shotgun metagenomics processing packages/pipelines including bioBakery, Just a Microbiology System (JAMS), Whole metaGenome Sequence Assembly V2 (WGSA2), and Woltka using 19 publicly available mock community samples and a set of five constructed pathogenic gut microbiome samples. Also included is a workflow for labelling bacterial scientific names with NCBI taxonomy identifiers for better resolution in assessing results. The Aitchison distance, a sensitivity metric, and total False Positive Relative Abundance were used for accuracy assessments for all pipelines and mock samples. Overall, bioBakery4 performed the best with most of the accuracy metrics, while JAMS and WGSA2, had the highest sensitivities. Furthermore, bioBakery is commonly used and only requires a basic knowledge of command line usage. This work provides an unbiased assessment of shotgun metagenomics packages and presents results assessing the performance of the packages using mock community sequence data.

Pubmed ID: 38233447 RIS Download

Research resources used in this publication

None found

Additional research tools detected in this publication

Antibodies used in this publication

None found

Associated grants

None

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.

NCBI Taxonomy (tool)

RRID:SCR_003256

Database for a curated classification and nomenclature that contains the names of all organisms that are represented in the public sequence databases with at least one nucleotide or protein sequence. Data provided encompasses archaea, bacteria, eukaryota, viroids and viruses. The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such.

View all literature mentions

Pfam (tool)

RRID:SCR_004726

A database of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Users can analyze protein sequences for Pfam matches, view Pfam family annotation and alignments, see groups of related families, look at the domain organization of a protein sequence, find the domains on a PDB structure, and query Pfam by keywords. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families that may automatically generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans (collections of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM).

View all literature mentions

MetaPhlAn (tool)

RRID:SCR_004915

THIS RESOURCE IS NO LONGER IN SERVICE. Documented on February 28,2023. Computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. It relies on unique clade-specific marker genes identified from reference genomes.

View all literature mentions

Biowulf at the NIH (tool)

RRID:SCR_007169

The NIH Biowulf cluster is a GNU/Linux parallel processing system designed and built at the National Institutes of Health and managed by the Helix Systems Staff. The system is designed for large numbers of simultaneous jobs common in bioinformatics as well as large-scale distributed memory tasks such as molecular dynamics. Sponsor: This work was supported by the National Institutes of Health Intramural Research Program through the Center for Information Technology and the National Institute of Neurological Disorders and Stroke, and by the Internal National Institute of Standards and Technology Research Fund. Keywords: Software, Program, Processing, System, Simulatenous, Bioinformatics, Memory, Molecular, Dynamics,

View all literature mentions

Globus Genomics (tool)

RRID:SCR_011887

Software-as-a-service for big data management offering fast, reliable, secure file transfer and sharing services to non-profit researchers. It combines state-of-the-art algorithms, data management tools, a graphical workflow environment, and an elastic computing infrastructure making it easy to manipulate, store, and share your data, no matter how big it gets.

View all literature mentions

AMOS (tool)

RRID:SCR_013067

A collection of tools and class interfaces for the assembly of DNA reads.

View all literature mentions

MultiQC (tool)

RRID:SCR_014982

Data aggregate that compiles results from bioinformatics analyses across multiple samples into a single report. It is written in Python.

View all literature mentions

Nephele (tool)

RRID:SCR_016595

Cloud based platform for simplified, standardized and reproducible microbiome data analysis. Allows users to process microbiome datasets through pipelines of existing software tools.

View all literature mentions

Seqtk (tool)

RRID:SCR_018927

Software fast and lightweight tool for processing sequences in FASTA or FASTQ format.

View all literature mentions

Mock community taxonomic classification performance of publicly available shotgun metagenomics pipelines.

Research resources used in this publication

Additional research tools detected in this publication

Antibodies used in this publication

Associated grants

This is a list of tools and resources that we have found mentioned in this publication.

NCBI Taxonomy (tool)

RRID:SCR_003256

Pfam (tool)

RRID:SCR_004726

MetaPhlAn (tool)

RRID:SCR_004915

Biowulf at the NIH (tool)

RRID:SCR_007169

Globus Genomics (tool)

RRID:SCR_011887

AMOS (tool)

RRID:SCR_013067

MultiQC (tool)

RRID:SCR_014982

Nephele (tool)

RRID:SCR_016595

Seqtk (tool)

RRID:SCR_018927

About

Recent News Entries

Contact Us

SciCrunch

Log in

Literature Report

Mock community taxonomic classification performance of publicly available shotgun metagenomics pipelines.

Research resources used in this publication

Additional research tools detected in this publication

Antibodies used in this publication

Associated grants

This is a list of tools and resources that we have found mentioned in this publication.

RRID:SCR_003256

RRID:SCR_004726

RRID:SCR_004915

RRID:SCR_007169

RRID:SCR_011887

RRID:SCR_013067

RRID:SCR_014982

RRID:SCR_016595

RRID:SCR_018927

About

Recent News Entries

Contact Us

SciCrunch