Major international projects are underway that are aimed at creating a comprehensive catalogue of all the genes responsible for the initiation and progression of cancer. These studies involve the sequencing of matched tumour-normal samples followed by mathematical analysis to identify those genes in which mutations occur more frequently than expected by random chance. Here we describe a fundamental problem with cancer genome studies: as the sample size increases, the list of putatively significant genes produced by current analytical methods burgeons into the hundreds. The list includes many implausible genes (such as those encoding olfactory receptors and the muscle protein titin), suggesting extensive false-positive findings that overshadow true driver events. We show that this problem stems largely from mutational heterogeneity and provide a novel analytical methodology, MutSigCV, for resolving the problem. We apply MutSigCV to exome sequences from 3,083 tumour-normal pairs and discover extraordinary variation in mutation frequency and spectrum within cancer types, which sheds light on mutational processes and disease aetiology, and in mutation frequency across the genome, which is strongly correlated with DNA replication timing and also with transcriptional activity. By incorporating mutational heterogeneity into the analyses, MutSigCV is able to eliminate most of the apparent artefactual findings and enable the identification of genes truly associated with cancer.
Pubmed ID: 23770567 RIS Download
Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.
Project exploring the spectrum of genomic changes involved in more than 20 types of human cancer that provides a platform for researchers to search, download, and analyze data sets generated. As a pilot project it confirmed that an atlas of changes could be created for specific cancer types. It also showed that a national network of research and technology teams working on distinct but related projects could pool the results of their efforts, create an economy of scale and develop an infrastructure for making the data publicly accessible. Its success committed resources to collect and characterize more than 20 additional tumor types. Components of the TCGA Research Network: * Biospecimen Core Resource (BCR); Tissue samples are carefully cataloged, processed, checked for quality and stored, complete with important medical information about the patient. * Genome Characterization Centers (GCCs); Several technologies will be used to analyze genomic changes involved in cancer. The genomic changes that are identified will be further studied by the Genome Sequencing Centers. * Genome Sequencing Centers (GSCs); High-throughput Genome Sequencing Centers will identify the changes in DNA sequences that are associated with specific types of cancer. * Proteome Characterization Centers (PCCs); The centers, a component of NCI's Clinical Proteomic Tumor Analysis Consortium, will ascertain and analyze the total proteomic content of a subset of TCGA samples. * Data Coordinating Center (DCC); The information that is generated by TCGA will be centrally managed at the DCC and entered into the TCGA Data Portal and Cancer Genomics Hub as it becomes available. Centralization of data facilitates data transfer between the network and the research community, and makes data analysis more efficient. The DCC manages the TCGA Data Portal. * Cancer Genomics Hub (CGHub); Lower level sequence data will be deposited into a secure repository. This database stores cancer genome sequences and alignments. * Genome Data Analysis Centers (GDACs) - Immense amounts of data from array and second-generation sequencing technologies must be integrated across thousands of samples. These centers will provide novel informatics tools to the entire research community to facilitate broader use of TCGA data. TCGA is actively developing a network of collaborators who are able to provide samples that are collected retrospectively (tissues that had already been collected and stored) or prospectively (tissues that will be collected in the future).
View all literature mentionsJava toolset for working with next generation sequencing data in the BAM format.
View all literature mentionsBiomedical and genomic research center located in Cambridge, Massachusetts, United States. Nonprofit research organization under the name Broad Institute Inc., and is partners with Massachusetts Institute of Technology, Harvard University, and the five Harvard teaching hospitals. Dedicated to advance understanding of biology and treatment of human disease to improve human health.
View all literature mentionsSoftware that analyzes lists of mutations discovered in DNA sequencing, to identify genes that were mutated more often than expected by chance given background mutation processes.
View all literature mentions