Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

A Maximum-Entropy approach for accurate document annotation in the biomedical domain.

Journal of biomedical semantics | 2012

The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH).The experimental evaluation shows that the suggested Maximum Entropy approach for annotating biomedical documents with MeSH terms is highly accurate, robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used. More precisely, we show that the proposed algorithm obtained an average F-measure of 92.4% (precision 99.41%, recall 86.77%) for the full range of the explored terms (4,078 MeSH terms), and that the algorithm's performance is resilient to terms' ambiguity, achieving an average F-measure of 92.42% (precision 99.32%, recall 86.87%) in the explored MeSH terms which were found to be ambiguous according to the Unified Medical Language System (UMLS) thesaurus. Finally, we compared the results of the suggested methodology with a Naive Bayes and a Decision Trees classification approach, and we show that the Maximum Entropy based approach performed with higher F-Measure in both ambiguous and monosemous MeSH terms.

Pubmed ID: 22541593 RIS Download

Research resources used in this publication

None found

Antibodies used in this publication

None found

Associated grants

None

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


PubMed (tool)

RRID:SCR_004846

Public bibliographic database that provides access to citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites. PubMed citations and abstracts include fields of biomedicine and health, covering portions of life sciences, behavioral sciences, chemical sciences, and bioengineering. Provides access to additional relevant web sites and links to other NCBI molecular biology resources. Publishers of journals can submit their citations to NCBI and then provide access to full-text of articles at journal web sites using LinkOut.

View all literature mentions

Unified Medical Language System (tool)

RRID:SCR_006363

Database of key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services, including electronic health records. This set of files and software brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems. Users can use the UMLS to enhance or develop applications, such as electronic health records, classification tools, dictionaries and language translators. The UMLS has three tools, which we call the Knowledge Sources: * Metathesaurus: Terms and codes from many vocabularies, including CPT, ICD-10-CM, LOINC, MeSH, RxNorm, and SNOMED CT * Semantic Network: Broad categories (semantic types) and their relationships (semantic relations) * SPECIALIST Lexicon and Lexical Tools: Natural language processing tools We use the Semantic Network and Lexical Tools to produce the Metathesaurus. Metathesaurus production involves: * Processing the terms and codes using the Lexical Tools * Grouping synonymous terms into concepts * Categorizing concepts by semantic types from the Semantic Network * Incorporating relationships and attributes provided by vocabularies * Releasing the data in a common format Although we integrate these tools for Metathesaurus production, you can access them separately or in any combination according to your needs. The UMLS Terminology Services (UTS) provides three ways to access the UMLS: Web Browsers, Local Installation, and Web Services APIs.

View all literature mentions

MeSH (tool)

RRID:SCR_004750

A controlled vocabulary thesaurus that consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity. MeSH, in machine-readable form, is provided at no charge via electronic means. MeSH descriptors are arranged in both an alphabetic and a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as Anatomy or Mental Disorders. More specific headings are found at more narrow levels of the twelve-level hierarchy, such as Ankle and Conduct Disorder. There are 27,149 descriptors in 2014 MeSH. There are also over 218,000 entry terms that assist in finding the most appropriate MeSH Heading, for example, Vitamin C is an entry term to Ascorbic Acid. In addition to these headings, there are more than 219,000 headings called Supplementary Concept Records (formerly Supplementary Chemical Records) within a separate thesaurus. The MeSH thesaurus is used by NLM for indexing articles from 5,400 of the world''''s leading biomedical journals for the MEDLINE/PubMED database. It is also used for the NLM-produced database that includes cataloging of books, documents, and audiovisuals acquired by the Library. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.

View all literature mentions