The human genome encodes 1500-2000 different transcription factors (TFs). ChIP-seq is revealing the global binding profiles of a fraction of TFs in a fraction of their biological contexts. These data show that the majority of TFs bind directly next to a large number of context-relevant target genes, that most binding is distal, and that binding is context specific. Because of the effort and cost involved, ChIP-seq is seldom used in search of novel TF function. Such exploration is instead done using expression perturbation and genetic screens. Here we propose a comprehensive computational framework for transcription factor function prediction. We curate 332 high-quality nonredundant TF binding motifs that represent all major DNA binding domains, and improve cross-species conserved binding site prediction to obtain 3.3 million conserved, mostly distal, binding site predictions. We combine these with 2.4 million facts about all human and mouse gene functions, in a novel statistical framework, in search of enrichments of particular motifs next to groups of target genes of particular functions. Rigorous parameter tuning and a harsh null are used to minimize false positives. Our novel PRISM (predicting regulatory information from single motifs) approach obtains 2543 TF function predictions in a large variety of contexts, at a false discovery rate of 16%. The predictions are highly enriched for validated TF roles, and 45 of 67 (67%) tested binding site regions in five different contexts act as enhancers in functionally matched cells.
Pubmed ID: 23382538 RIS Download
Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.
A collection of Pathway/Genome Databases which describes the genome and metabolic pathways of a single organism. The BioCyc collection of Pathway/Genome Databases (PGDBs) provides an electronic reference source on the genomes and metabolic pathways of sequenced organisms. BioCyc PGDBs are generated by software that predicts the metabolic pathway complements of completely sequenced organisms from their genome sequences. They also include the results of a number of other computational inference procedures applied to these genomes, including predictions of which genes code for missing enzymes in metabolic pathways, and predicted operons. The BioCyc Web site provides a suite of software tools for database searching and visualization, for omics data analysis, and for comparative genomics and comparative pathway questions. The databases within the BioCyc collection are organized into tiers according to the amount of manual review and updating they have received. Tier 1 PGDBs have been created through intensive manual efforts, and receive continuous updating. Tier 2 PGDBs were computationally generated by the PathoLogic program, and have undergone moderate amounts of review and updating. Tier 3 PGDBs were computationally generated by the PathoLogic program, and have undergone no review and updating. There are 967 DBs in Tier 3. The downloadable version of BioCyc that includes the Pathway Tools software provides more speed and power than the BioCyc Web site.
View all literature mentionsDatabase that hosts experimental data from universal protein binding microarray (PBM) experiments (Berger et al., 2006) and their accompanying statistical analyses from prokaryotic and eukaryotic organisms, malarial parasites, yeast, worms, mouse, and human. It provides a centralized resource for accessing comprehensive data on the preferences of proteins for all possible sequence variants ("words") of length k ("k-mers"), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. The database's web tools include a text-based search, a function for assessing motif similarity between user-entered data and database PWMs, and a function for locating putative binding sites along user-entered nucleotide sequences.
View all literature mentionsCell line Jurkat is a Cancer cell line with a species of origin Homo sapiens (Human)
View all literature mentions