Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

tmChem: a high performance approach for chemical named entity recognition and normalization.

Journal of cheminformatics | 2015

Chemical compounds and drugs are an important class of entities in biomedical research with great potential in a wide range of applications, including clinical medicine. Locating chemical named entities in the literature is a useful step in chemical text mining pipelines for identifying the chemical mentions, their properties, and their relationships as discussed in the literature. We introduce the tmChem system, a chemical named entity recognizer created by combining two independent machine learning models in an ensemble. We use the corpus released as part of the recent CHEMDNER task to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation) and 0.8745 f-measure on the CDI subtask (abstract-level evaluation). We also report a high-recall combination (0.9212 for CEM and 0.9224 for CDI). tmChem achieved the highest f-measure reported in the CHEMDNER task for the CEM subtask, and the high recall variant achieved the highest recall on both the CEM and CDI tasks. We report that tmChem is a state-of-the-art tool for chemical named entity recognition and that performance for chemical named entity recognition has now tied (or exceeded) the performance previously reported for genes and diseases. Future research should focus on tighter integration between the named entity recognition and normalization steps for improved performance. The source code and a trained model for both models of tmChem is available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmChem. The results of running tmChem (Model 2) on PubMed are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator.

Pubmed ID: 25810774 RIS Download

Research resources used in this publication

None found

Antibodies used in this publication

None found

Associated grants

None

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


BioCreative (tool)

RRID:SCR_006311

Community-wide effort (Challenge) for evaluating text mining and information extraction systems applied to the biological domain. It is focused on the comparison of methods and the community assessment of scientific progress, rather than on the purely competitive aspects. There is a considerable difficulty in constructing suitable gold standard data for training and testing new information extraction systems which handle life science literature. Thus the data sets derived from the BioCreAtIvE challenge - because they have been examined by biological database curators and domain experts - serve as useful resources for the development of new applications as well as helping to improve existing ones. Two main issues are addressed at BioCreAtIvE, both concerned with the extraction of biologically relevant and useful information from the literature. The first one is concerned with the detection of biologically significant entities (names) such as gene and protein names and their association to existing database entries. The second one is concerned with the detection of entity-fact associations (e.g. protein - functional term associations ).

View all literature mentions

Comparative Toxicogenomics Database (CTD) (tool)

RRID:SCR_006530

A public database that enhances understanding of the effects of environmental chemicals on human health. Integrated GO data and a GO browser add functionality to CTD by allowing users to understand biological functions, processes and cellular locations that are the targets of chemical exposures. CTD includes curated data describing cross-species chemical–gene/protein interactions, chemical–disease and gene–disease associations to illuminate molecular mechanisms underlying variable susceptibility and environmentally influenced diseases. These data will also provide insights into complex chemical–gene and protein interaction networks.

View all literature mentions

Intramural Research Program (tool)

RRID:SCR_012734

A research program of the NIA which focuses on neuroscience, aging biology, and translational gerontology. The central focus of the program's research is understanding age-related changes in physiology and the ability to adapt to environmental stress, and using that understanding to develop insight about the pathophysiology of age-related diseases. The IRP webpage provides access to other NIH resources such as the Biological Biochemical Image Database, the Bioinformatics Portal, and the Baltimore Longitudinal Study of Aging.

View all literature mentions

CHEBI (tool)

RRID:SCR_002088

Collection of chemical compounds and other small molecular entities that incorporates an ontological classification of chemical compounds of biological relevance, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. The molecular entities in question are either products of nature or synthetic products used to intervene in the processes of living organisms.

View all literature mentions

MeSH (tool)

RRID:SCR_004750

A controlled vocabulary thesaurus that consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity. MeSH, in machine-readable form, is provided at no charge via electronic means. MeSH descriptors are arranged in both an alphabetic and a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as Anatomy or Mental Disorders. More specific headings are found at more narrow levels of the twelve-level hierarchy, such as Ankle and Conduct Disorder. There are 27,149 descriptors in 2014 MeSH. There are also over 218,000 entry terms that assist in finding the most appropriate MeSH Heading, for example, Vitamin C is an entry term to Ascorbic Acid. In addition to these headings, there are more than 219,000 headings called Supplementary Concept Records (formerly Supplementary Chemical Records) within a separate thesaurus. The MeSH thesaurus is used by NLM for indexing articles from 5,400 of the world''''s leading biomedical journals for the MEDLINE/PubMED database. It is also used for the NLM-produced database that includes cataloging of books, documents, and audiovisuals acquired by the Library. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.

View all literature mentions