Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts.
Pubmed ID: 19622167 RIS Download
Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.
An information extracting and processing package for biological literature that can be used online or installed locally via a downloadable software package, http://www.textpresso.org/downloads.html Textpresso's two major elements are (1) access to full text, so that entire articles can be searched, and (2) introduction of categories of biological concepts and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., methods, etc). A search engine enables the user to search for one or a combination of these categories and/or keywords within an entire literature. The Textpresso project serves the biological and biomedical research community by providing: * Full text literature searches of model organism research and subject-specific articles at individual sites. Major elements of these search engines are (1) access to full text, so that the entire content of articles can be searched, and (2) search capabilities using categories of biological concepts and classes that relate two objects (e.g., association, regulation, etc.) or identify one (e.g., cell, gene, allele, etc). The search engines are flexible, enabling users to query the entire literature using keywords, one or more categories or a combination of keywords and categories. * Text classification and mining of biomedical literature for database curation. They help database curators to identify and extract biological entities and facts from the full text of research articles. Examples of entity identification and extraction include new allele and gene names and human disease gene orthologs; examples of fact identification and extraction include sentence retrieval for curating gene-gene regulation, Gene Ontology (GO) cellular components and GO molecular function annotations. In addition they classify papers according to curation needs. They employ a variety of methods such as hidden Markov models, support vector machines, conditional random fields and pattern matches. Our collaborators include WormBase, FlyBase, SGD, TAIR, dictyBase and the Neuroscience Information Framework. They are looking forward to collaborating with more model organism databases and projects. * Linking biological entities in PDF and online journal articles to online databases. They have established a journal article mark-up pipeline that links select content of Genetics journal articles to model organism databases such as WormBase and SGD. The entity markup pipeline links over nine classes of objects including genes, proteins, alleles, phenotypes, and anatomical terms to the appropriate page at each database. The first article published with online and PDF-embedded hyperlinks to WormBase appeared in the September 2009 issue of Genetics. As of January 2011, we have processed around 70 articles, to be continued indefinitely. Extension of this pipeline to other journals and model organism databases is planned. Textpresso is useful as a search engine for researchers as well as a curation tool. It was developed as a part of WormBase and is used extensively by C. elegans curators. Textpresso has currently been implemented for 24 different literatures, among them Neuroscience, and can readily be extended to other corpora of text.
View all literature mentions