Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

Biclustering methods: biological relevance and application in gene expression analysis.

PloS one | 2014

DNA microarray technologies are used extensively to profile the expression levels of thousands of genes under various conditions, yielding extremely large data-matrices. Thus, analyzing this information and extracting biologically relevant knowledge becomes a considerable challenge. A classical approach for tackling this challenge is to use clustering (also known as one-way clustering) methods where genes (or respectively samples) are grouped together based on the similarity of their expression profiles across the set of all samples (or respectively genes). An alternative approach is to develop biclustering methods to identify local patterns in the data. These methods extract subgroups of genes that are co-expressed across only a subset of samples and may feature important biological or medical implications. In this study we evaluate 13 biclustering and 2 clustering (k-means and hierarchical) methods. We use several approaches to compare their performance on two real gene expression data sets. For this purpose we apply four evaluation measures in our analysis: (1) we examine how well the considered (bi)clustering methods differentiate various sample types; (2) we evaluate how well the groups of genes discovered by the (bi)clustering methods are annotated with similar Gene Ontology categories; (3) we evaluate the capability of the methods to differentiate genes that are known to be specific to the particular sample types we study and (4) we compare the running time of the algorithms. In the end, we conclude that as long as the samples are well defined and annotated, the contamination of the samples is limited, and the samples are well replicated, biclustering methods such as Plaid and SAMBA are useful for discovering relevant subsets of genes and samples.

Pubmed ID: 24651574 RIS Download

Research resources used in this publication

None found

Antibodies used in this publication

None found

Associated grants

None

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


The Cancer Genome Atlas (tool)

RRID:SCR_003193

Project exploring the spectrum of genomic changes involved in more than 20 types of human cancer that provides a platform for researchers to search, download, and analyze data sets generated. As a pilot project it confirmed that an atlas of changes could be created for specific cancer types. It also showed that a national network of research and technology teams working on distinct but related projects could pool the results of their efforts, create an economy of scale and develop an infrastructure for making the data publicly accessible. Its success committed resources to collect and characterize more than 20 additional tumor types. Components of the TCGA Research Network: * Biospecimen Core Resource (BCR); Tissue samples are carefully cataloged, processed, checked for quality and stored, complete with important medical information about the patient. * Genome Characterization Centers (GCCs); Several technologies will be used to analyze genomic changes involved in cancer. The genomic changes that are identified will be further studied by the Genome Sequencing Centers. * Genome Sequencing Centers (GSCs); High-throughput Genome Sequencing Centers will identify the changes in DNA sequences that are associated with specific types of cancer. * Proteome Characterization Centers (PCCs); The centers, a component of NCI's Clinical Proteomic Tumor Analysis Consortium, will ascertain and analyze the total proteomic content of a subset of TCGA samples. * Data Coordinating Center (DCC); The information that is generated by TCGA will be centrally managed at the DCC and entered into the TCGA Data Portal and Cancer Genomics Hub as it becomes available. Centralization of data facilitates data transfer between the network and the research community, and makes data analysis more efficient. The DCC manages the TCGA Data Portal. * Cancer Genomics Hub (CGHub); Lower level sequence data will be deposited into a secure repository. This database stores cancer genome sequences and alignments. * Genome Data Analysis Centers (GDACs) - Immense amounts of data from array and second-generation sequencing technologies must be integrated across thousands of samples. These centers will provide novel informatics tools to the entire research community to facilitate broader use of TCGA data. TCGA is actively developing a network of collaborators who are able to provide samples that are collected retrospectively (tissues that had already been collected and stored) or prospectively (tissues that will be collected in the future).

View all literature mentions

Bioconductor (tool)

RRID:SCR_006442

Software repository for R packages related to analysis and comprehension of high throughput genomic data. Uses separate set of commands for installation of packages. Software project based on R programming language that provides tools for analysis and comprehension of high throughput genomic data.

View all literature mentions

InterPro (tool)

RRID:SCR_006695

Service providing functional analysis of proteins by classifying them into families and predicting domains and important sites. They combine protein signatures from a number of member databases into a single searchable resource, capitalizing on their individual strengths to produce a powerful integrated database and diagnostic tool. This integrated database of predictive protein signatures is used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures. You can access the data programmatically, via Web Services. The member databases use a number of approaches: # ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST. # PROSITE patterns: provider of simple regular expressions. # PROSITE and HAMAP profiles: provide sequence matrices. # PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs). # PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs). Your contributions are welcome. You are encouraged to use the ''''Add your annotation'''' button on InterPro entry pages to suggest updated or improved annotation for individual InterPro entries.

View all literature mentions

ArrayExpress (tool)

RRID:SCR_002964

International functional genomics data collection generated from microarray or next-generation sequencing (NGS) platforms. Repository of functional genomics data supporting publications. Provides genes expression data for reuse to the research community where they can be queried and downloaded. Integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Contains a subset of curated and re-annotated Archive data which can be queried for individual gene expression under different biological conditions across experiments. Data collected to MIAME and MINSEQE standards. Data are submitted by users or are imported directly from the NCBI Gene Expression Omnibus.

View all literature mentions

Entrez Gene (tool)

RRID:SCR_002473

Database for genomes that have been completely sequenced, have active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis. Includes nomenclature, map location, gene products and their attributes, markers, phenotypes, and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases. All entries follow NCBI's format for data collections. Content of Entrez Gene represents result of curation and automated integration of data from NCBI's Reference Sequence project (RefSeq), from collaborating model organism databases, and from many other databases available from NCBI. Records are assigned unique, stable and tracked integers as identifiers. Content is updated as new information becomes available.

View all literature mentions