Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

GREC Corpus

A semantically annotated corpus of 240 MEDLINE abstracts (167 on the subject of E. coli species and 73 on the subject of the Human species) intended for training information extraction (IE) systems and/or resources which are used to extract events from biomedical literature. The corpus has been manually annotated with events relating to gene regulation by biologists. Each event is centered on either a verb (e.g. transcribe) or nominalized verb (e.g. transcription) and annotation consists of identifying, as exhaustively as possible, the structurally-related arguments of the verb or nominalized verb within the same sentence. Each event argument is then assigned the following information: * A semantic role from a fixed set of 13 roles which are tailored to the biomedical domain. * A biomedical concept type (where appropriate). The corpus in available for download in 2 formats: * A standoff format, based on the BioNLP''''09 Shared Task format * An XML format, based on the GENIA event annotation format

URL: http://www.nactem.ac.uk/GREC/

Resource ID: nif-0000-06688     Resource Type: Resource     Version: Latest Version


annotation, information extraction, text mining, semantic role, semantic search, gene, computational linguistics, gene regulation

Funding Information


Listed By

FORCE11, Beyond the pdf


human, escherichia coli

Publication Link


Related To




Parent Organization


Gene Event Regulation Corpus

Additional Resource Types

Training Set


Creative Commons Attribution-NonCommercial-ShareAlike License, v3 Unported, For Copyright of abstracts refer to PubMed.



Original Submitter


Version Status


Submitted On

12:00am February 18, 2011

Originated From


Changes from Previous Version

  • Additional Resource Types was changed

Version 4

Created 1 month ago by Christie Wang

Version 3

Created 2 months ago by Christie Wang

Version 2

Created 2 months ago by Christie Wang

Version 1

Created 5 years ago by Anonymous

Construction of an annotated corpus to support biomedical information extraction.

  • Thompson P
  • BMC Bioinformatics
  • 2009 10

BACKGROUND: Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. RESULTS: We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. CONCLUSION: The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes.