FDI Lab - SciCrunch.org | Searching in Literature

Ensembl 2011.

Paul Flicek‎ et al.
Nucleic acids research‎
2011‎

The Ensembl project (http://www.ensembl.org) seeks to enable genomic science by providing high quality, integrated annotation on chordate and selected eukaryotic genomes within a consistent and accessible infrastructure. All supported species include comprehensive, evidence-based gene annotations and a selected set of genomes includes additional data focused on variation, comparative, evolutionary, functional and regulatory annotation. The most advanced resources are provided for key species including human, mouse, rat and zebrafish reflecting the popularity and importance of these species in biomedical research. As of Ensembl release 59 (August 2010), 56 species are supported of which 5 have been added in the past year. Since our previous report, we have substantially improved the presentation and integration of both data of disease relevance and the regulatory state of different cell types.

GENCODE 2021.

Adam Frankish‎ et al.
Nucleic acids research‎
2021‎

The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.

GENCODE reference annotation for the human and mouse genomes.

Adam Frankish‎ et al.
Nucleic acids research‎
2019‎

The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.

Ensembl 2012.

Paul Flicek‎ et al.
Nucleic acids research‎
2012‎

The Ensembl project (http://www.ensembl.org) provides genome resources for chordate genomes with a particular focus on human genome data as well as data for key model organisms such as mouse, rat and zebrafish. Five additional species were added in the last year including gibbon (Nomascus leucogenys) and Tasmanian devil (Sarcophilus harrisii) bringing the total number of supported species to 61 as of Ensembl release 64 (September 2011). Of these, 55 species appear on the main Ensembl website and six species are provided on the Ensembl preview site (Pre!Ensembl; http://pre.ensembl.org) with preliminary support. The past year has also seen improvements across the project.

NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence.

Thomas A Down‎ et al.
Nucleic acids research‎
2005‎

NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites.

Large-scale discovery of promoter motifs in Drosophila melanogaster.

Thomas A Down‎ et al.
PLoS computational biology‎
2007‎

A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes.

Ensembl 2014.

Paul Flicek‎ et al.
Nucleic acids research‎
2014‎

Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been extended with additional support for comparative genomics and ontology information. Finally, we provide updated information about our methods for data access and resources for user training.

A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

Thomas Down‎ et al.
BMC bioinformatics‎
2006‎

The splicing of RNA transcripts is thought to be partly promoted and regulated by sequences embedded within exons. Known sequences include binding sites for SR proteins, which are thought to mediate interactions between splicing factors bound to the 5' and 3' splice sites. It would be useful to identify further candidate sequences, however identifying them computationally is hard since exon sequences are also constrained by their functional role in coding for proteins.

What can we learn from noncoding regions of similarity between genomes?

Thomas A Down‎ et al.
BMC bioinformatics‎
2004‎

In addition to known protein-coding genes, large amounts of apparently non-coding sequence are conserved between the human and mouse genomes. It seems reasonable to assume that these conserved regions are more likely to contain functional elements than less-conserved portions of the genome.

Automated PDF highlighting to support faster curation of literature for Parkinson's and Alzheimer's disease.

Honghan Wu‎ et al.
Database : the journal of biological databases and curation‎
2017‎

Neurodegenerative disorders such as Parkinson's and Alzheimer's disease are devastating and costly illnesses, a source of major global burden. In order to provide successful interventions for patients and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of neurodegenerative disorders by manually curating and abstracting data from the vast body of literature amassed on these illnesses. As curation is labour-intensive, we aimed to speed up the process by automatically highlighting those parts of the PDF document of primary importance to the curator. Using techniques similar to those of summarisation, we developed an algorithm that relies on linguistic, semantic and spatial features. Employing this algorithm on a test set manually corrected for tool imprecision, we achieved a macro F 1 -measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which reveals that in 85% of cases all highlighted sentences are relevant to the curation task and in about 65% of the cases, the highlights are sufficient to support the knowledge curation task without needing to consult the full text. In conclusion, we believe that these are promising results for a step in automating the recognition of curation-relevant sentences. Refining our approach to pre-digest papers will lead to faster processing and cost reduction in the curation process.

Ensembl's 10th year.

Paul Flicek‎ et al.
Nucleic acids research‎
2010‎

Ensembl (http://www.ensembl.org) integrates genomic information for a comprehensive set of chordate genomes with a particular focus on resources for human, mouse, rat, zebrafish and other high-value sequenced genomes. We provide complete gene annotations for all supported species in addition to specific resources that target genome variation, function and evolution. Ensembl data is accessible in a variety of formats including via our genome browser, API and BioMart. This year marks the tenth anniversary of Ensembl and in that time the project has grown with advances in genome technology. As of release 56 (September 2009), Ensembl supports 51 species including marmoset, pig, zebra finch, lizard, gorilla and wallaby, which were added in the past year. Major additions and improvements to Ensembl since our previous report include the incorporation of the human GRCh37 assembly, enhanced visualisation and data-mining options for the Ensembl regulatory features and continued development of our software infrastructure.

Data growth and its impact on the SCOP database: new developments.

Antonina Andreeva‎ et al.
Nucleic acids research‎
2008‎

The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.

GENCODE: reference annotation for the human and mouse genomes in 2023.

Adam Frankish‎ et al.
Nucleic acids research‎
2023‎

GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.

Ensembl 2013.

Paul Flicek‎ et al.
Nucleic acids research‎
2013‎

The Ensembl project (http://www.ensembl.org) provides genome information for sequenced chordate genomes with a particular focus on human, mouse, zebrafish and rat. Our resources include evidenced-based gene sets for all supported species; large-scale whole genome multiple species alignments across vertebrates and clade-specific alignments for eutherian mammals, primates, birds and fish; variation data resources for 17 species and regulation annotations based on ENCODE and other data sets. Ensembl data are accessible through the genome browser at http://www.ensembl.org and through other tools and programmatic interfaces.

Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

Ensembl 2011.

GENCODE 2021.

GENCODE reference annotation for the human and mouse genomes.

Ensembl 2012.

NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence.

Large-scale discovery of promoter motifs in Drosophila melanogaster.

Ensembl 2014.

A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

What can we learn from noncoding regions of similarity between genomes?

Automated PDF highlighting to support faster curation of literature for Parkinson's and Alzheimer's disease.

Ensembl's 10th year.

Data growth and its impact on the SCOP database: new developments.

GENCODE: reference annotation for the human and mouse genomes in 2023.

Ensembl 2013.

SciCrunch.org Resources

Navigation

Logging in and Registering

Searching

Save Your Search

Query Expansion

Collections

Facets

Options

Further Questions

About

Recent News Entries

Contact Us

SciCrunch

Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

Log in

Log in

Literature

Current Facets and Filters

Options

Facets

Recent searches

.in-collection { color: green; } Ensembl 2011.

.in-collection { color: green; } GENCODE 2021.

.in-collection { color: green; } GENCODE reference annotation for the human and mouse genomes.

.in-collection { color: green; } Ensembl 2012.

.in-collection { color: green; } NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence.

.in-collection { color: green; } Large-scale discovery of promoter motifs in Drosophila melanogaster.

.in-collection { color: green; } Ensembl 2014.

.in-collection { color: green; } A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

.in-collection { color: green; } What can we learn from noncoding regions of similarity between genomes?

.in-collection { color: green; } Automated PDF highlighting to support faster curation of literature for Parkinson's and Alzheimer's disease.

.in-collection { color: green; } Ensembl's 10th year.

.in-collection { color: green; } Data growth and its impact on the SCOP database: new developments.

.in-collection { color: green; } GENCODE: reference annotation for the human and mouse genomes in 2023.

.in-collection { color: green; } Ensembl 2013.

SciCrunch.org Resources

Navigation

Logging in and Registering

Searching

Save Your Search

Query Expansion

Collections

Facets

Options

Further Questions

Publications Per Year

About

Recent News Entries

Contact Us

SciCrunch

Ensembl 2011.

GENCODE 2021.

GENCODE reference annotation for the human and mouse genomes.

Ensembl 2012.

NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence.

Large-scale discovery of promoter motifs in Drosophila melanogaster.

Ensembl 2014.

A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

What can we learn from noncoding regions of similarity between genomes?

Automated PDF highlighting to support faster curation of literature for Parkinson's and Alzheimer's disease.

Ensembl's 10th year.

Data growth and its impact on the SCOP database: new developments.

GENCODE: reference annotation for the human and mouse genomes in 2023.

Ensembl 2013.