Haplotype phasing of genetic variants is important for clinical interpretation of the genome, population genetic analysis and functional genomic analysis of allelic activity. Here we present phASER, an accurate approach for phasing variants that are overlapped by sequencing reads, including those from RNA sequencing (RNA-seq), which often span multiple exons due to splicing. Using diverse RNA-seq data we demonstrate that this provides more accurate phasing of rare variants compared with population-based phasing and allows phasing of variants in the same gene up to hundreds of kilobases away that cannot be obtained from DNA sequencing (DNA-seq) reads. We show that in the context of medical genetic studies this improves the resolution of compound heterozygotes. Additionally, phASER provides measures of haplotypic expression that increase power and accuracy in studies of allelic expression. In summary, phasing using RNA-seq and phASER is accurate and improves studies where rare variant haplotypes or allelic expression is needed.
Pubmed ID: 27605262 RIS Download
Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.
International collaboration producing an extensive public catalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts, in an effort to provide a foundation for investigating the relationship between genotype and phenotype. The genomes of about 2500 unidentified people from about 25 populations around the world were sequenced using next-generation sequencing technologies. Redundant sequencing on various platforms and by different groups of scientists of the same samples can be compared. The results of the study are freely and publicly accessible to researchers worldwide. The consortium identified the following populations whose DNA will be sequenced: Yoruba in Ibadan, Nigeria; Japanese in Tokyo; Chinese in Beijing; Utah residents with ancestry from northern and western Europe; Luhya in Webuye, Kenya; Maasai in Kinyawa, Kenya; Toscani in Italy; Gujarati Indians in Houston; Chinese in metropolitan Denver; people of Mexican ancestry in Los Angeles; and people of African ancestry in the southwestern United States. The goal Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. Sequencing is still too expensive to deeply sequence the many samples being studied for this project. However, any particular region of the genome generally contains a limited number of haplotypes. Data can be combined across many samples to allow efficient detection of most of the variants in a region. The Project currently plans to sequence each sample to about 4X coverage; at this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing. All samples from the 1000 genomes are available as lymphoblastoid cell lines (LCLs) and LCL derived DNA from the Coriell Cell Repository as part of the NHGRI Catalog. The sequence and alignment data generated by the 1000genomes project is made available as quickly as possible via their mirrored ftp sites. ftp://ftp.1000genomes.ebi.ac.uk ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes
View all literature mentionsNumPy is the fundamental package needed for scientific computing with Python. It contains among other things: * a powerful N-dimensional array object * sophisticated (broadcasting) functions * tools for integrating C/C and Fortran code * useful linear algebra, Fourier transform, and random number capabilities. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. Sponsored by ENTHOUGHT
View all literature mentionsA software package to analyze next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size. This software library makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner. (entry from Genetic Analysis Software)
View all literature mentionsA web-based hosting service for software development projects that use the Git revision control system offering powerful collaboration, code review, and code management. It offers both paid plans for private repositories, and free accounts for open source projects. Large or small, every repository comes with the same powerful tools. These tools are open to the community for public projects and secure for private projects. Features include: * Integrated issue tracking * Collaborative code review * Easily manage teams within organizations * Text entry with understated power * A growing list of programming languages and data formats * On the desktop and in your pocket - Android app and mobile web views let you keep track of your projects on the go.
View all literature mentionsA next-generation web-based application that aims to provide an integrated solution for both visualization and analysis of deep-sequencing data, along with simple access to public datasets.
View all literature mentionsA powerful toolset for genome arithmetic allowing one to address common genomics tasks such as finding feature overlaps and computing coverage. Bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.
View all literature mentionsA Python-based environment of open-source software for mathematics, science, and engineering. The core packages of SciPy include: NumPy, a base N-dimensional array package; SciPy Library, a fundamental library for scientific computing; and IPython, an enhanced interactive console.
View all literature mentionsA max-cut based algorithm for haplotype assembly using sequence reads from the two chromosomes of an individual.
View all literature mentionsCrystallographic software which solves structures using algorithms and automated rapid search calculations to perform molecular replacement and experimental phasing methods.
View all literature mentions