Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

Single haplotype assembly of the human genome from a hydatidiform mole.

Genome research | 2014

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

Pubmed ID: 25373144 RIS Download

Associated grants

  • Agency: NHGRI NIH HHS, United States
    Id: U41 HG007635
  • Agency: Howard Hughes Medical Institute, United States
  • Agency: NHGRI NIH HHS, United States
    Id: 5P01HG004120
  • Agency: NHGRI NIH HHS, United States
    Id: 2R01HG002385
  • Agency: NHGRI NIH HHS, United States
    Id: R01 HG002385
  • Agency: NHGRI NIH HHS, United States
    Id: P01 HG004120

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


VCFtools (tool)

RRID:SCR_001235

Software package for working with VCF files. Used to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.Implements various utilities for processing Variant Call Format files, including validation, merging, comparing. Provides general Perl API.

View all literature mentions

BacPac Resources Center (tool)

RRID:SCR_001520

It is the distribution arm of their academic laboratory. They operate on a cost-recovery mechanism in order to make the resources generated in their laboratory available to the academic scientific community. While clones and screening services are widely available, library arrays are primarily available to researchers with a scientific need to analyze most clones in the library. This site contains information on currently available BAC and PAC genomic DNA libraries, BAC Clones, PAC Clones, Fosmid Clones, cDNA collections, high-density colony hybridization filters, and BAC and PAC cloning vectors. Protocols used in our laboratory for the hybridization-based screening of colony filters, purification of BAC and PAC DNA, and end-sequencing methodologies, are also provided. BPRC does not list clones, for two reasons: 1)most clones have not been characterized and lack specific data. 2)all clones are part of libraries and all clones from a particular library share common characteristics. Hence, to find out if BPRC has a particular clone, one needs either use Automatic Clone Validation or else find out if the clone is compatible with the range of clone names for a corresponding clone library. Typically (although not always), clone names are derived from the library name. BPRC uses the NCBI-recommended clone nomenclature & library nomenclature. Most arrayed libraries are available in frozen microtiter dish format to academic and non-academic users provided that there is a scientific need for complete-library access. (for instance to annotate, modify or analyze all BAC clones as part of a genome project).

View all literature mentions

GenBank (tool)

RRID:SCR_002760

NIH genetic sequence database that provides annotated collection of all publicly available DNA sequences for almost 280 000 formally described species (Jan 2014) .These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. It is part of International Nucleotide Sequence Database Collaboration and daily data exchange with European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through NCBI Entrez retrieval system, which integrates data from major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of GenBank database are available by FTP.

View all literature mentions

HuRef (tool)

RRID:SCR_002952

Database for the diploid genome sequence of J. Craig Venter as published in PLoS Biology. Its graphical interface depicts the haploid sequence with SNP and insertion/deletion DNA variants as identified by genome assembly and comparison methods, as well as represents the haplotype blocks from which diploid genome sequence can be inferred and gene annotations.

View all literature mentions

mrsFAST (tool)

RRID:SCR_003128

A cache-oblivious algorithm designed to map short reads to reference genome assemblies in a fast and memory-efficient manner. It optimizes cache usage to get higher performance. Currently Supported Features: * Mistmatches, No indels * Paired-end Mapping Mode * Discordant Paired-end Mapping Mode (to be used in conjuction with Variation Hunter)

View all literature mentions

NCBI Sequence Read Archive (SRA) (tool)

RRID:SCR_004891

Repository of raw sequencing data from next generation of sequencing platforms including including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT. In addition to raw sequence data, SRA now stores alignment information in form of read placements on reference sequence. Data submissions are welcome. Archive of high throughput sequencing data,part of international partnership of archives (INSDC) at NCBI, European Bioinformatics Institute and DNA Database of Japan. Data submitted to any of this three organizations are shared among them.

View all literature mentions

ClinVar (tool)

RRID:SCR_006169

Archive of aggregated information about sequence variation and its relationship to human health. Provides reports of relationships among human variations and phenotypes along with supporting evidence. Submissions from clinical testing labs, research labs, locus-specific databases, expert panels and professional societies are welcome. Collects reports of variants found in patient samples, assertions made regarding their clinical significance, information about submitter, and other supporting data. Alleles described in submissions are mapped to reference sequences, and reported according to HGVS standard.

View all literature mentions

VARSCAN (tool)

RRID:SCR_006849

A platform-independent, technology-independent software tool for identifying SNPs and indels in massively parallel sequencing of individual and pooled samples. Given data for a single sample, VarScan identifies and filters germline variants based on read counts, base quality, and allele frequency. Given data for a tumor-normal pair, VarScan also determines the somatic status of each variant (Germline, Somatic, or LOH) by comparing read counts between samples. (entry from Genetic Analysis Software)

View all literature mentions

Single Read Paired Read Indel Substitution Minimizer (tool)

RRID:SCR_018023

Software tool as single read paired read indel substitution minimizer.

View all literature mentions