Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

Determining similarity of scientific entities in annotation datasets.

Database : the journal of biological databases and curation | 2015

Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/

Pubmed ID: 25725057 RIS Download

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


Weka (tool)

RRID:SCR_001214

A collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

View all literature mentions

UniProt (tool)

RRID:SCR_002380

Collection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.

View all literature mentions

DrugBank (tool)

RRID:SCR_002700

Bioinformatics and cheminformatics database that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.

View all literature mentions

SGD (tool)

RRID:SCR_004694

A curated database that provides comprehensive integrated biological information for Saccharomyces cerevisiae along with search and analysis tools to explore these data. SGD allows researchers to discover functional relationships between sequence and gene products in fungi and higher organisms. The SGD also maintains the S. cerevisiae Gene Name Registry, a complete list of all gene names used in S. cerevisiae which includes a set of general guidelines to gene naming. Protein Page provides basic protein information calculated from the predicted sequence and contains links to a variety of secondary structure and tertiary structure resources. Yeast Biochemical Pathways allows users to view and search for biochemical reactions and pathways that occur in S. cerevisiae as well as map expression data onto the biochemical pathways. Literature citations are provided where available.

View all literature mentions

PubMed (tool)

RRID:SCR_004846

Public bibliographic database that provides access to citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites. PubMed citations and abstracts include fields of biomedicine and health, covering portions of life sciences, behavioral sciences, chemical sciences, and bioengineering. Provides access to additional relevant web sites and links to other NCBI molecular biology resources. Publishers of journals can submit their citations to NCBI and then provide access to full-text of articles at journal web sites using LinkOut.

View all literature mentions

FASTA (tool)

RRID:SCR_011819

Software package for DNA and protein sequence alignment to find regions of local or global similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence.

View all literature mentions

Europe PubMed Central (tool)

RRID:SCR_005901

Free access to biomedical literature resources including all of PubMed and PubMed Central, agricultural abstracts (from AGRICOLA), over 4 million international life science patents abstracts, National Health Service (NHS) clinical guidelines, and is supplemented with Chinese Biological Abstracts and the Citeseer database. As well as powerful search of abstracts and full text articles, it also includes: * article citations and sort order based on citation count * data citations mined from full text articles * links to and from related databases and institutional repositories * a tool to create bibliographies linked to your ORCID * named entity recognition of keywords and text-mining-based applications showcased in Europe PMC Labs * Tools for recipients of grants from one of the Europe PMC funders to deposit full-text manuscripts and link them to those specific grants. * Web services for programmatic access to all the above bibliographic information and 50,000 grants. * Search by publication date, relevance, or the number of times an article has been cited. * Links to public databases such as UniProt, Protein Data Bank (PDBe), and the European Nucleotide Archive (ENA) are provided. * Through textmining technologies, you can highlight and browse keywords such as gene names, organisms and diseases. * Search 40,000 biomedical research grants awarded to the 18,000 PIs supported by the Europe PMC funders. * Roadtest new tools based on Europe PMC content in Europe PMC labs. * In Europe PMC plus, PIs supported by the Europe PMC funders can link grants to publication information, view article citation and download statistics, and submit manuscripts.

View all literature mentions

KEGG (tool)

RRID:SCR_012773

Integrated database resource consisting of 16 main databases, broadly categorized into systems information, genomic information, and chemical information. In particular, gene catalogs in completely sequenced genomes are linked to higher-level systemic functions of cell, organism, and ecosystem. Analysis tools are also available. KEGG may be used as reference knowledge base for biological interpretation of large-scale datasets generated by sequencing and other high-throughput experimental technologies.

View all literature mentions

LINCS Connectivity Map (tool)

RRID:SCR_002639

A catalog of gene-expression data collected from human cells treated with chemical compounds and genetic reagents. Computational methods to reduce the number of necessary genomic measurements along with streamlined methodologies enable the current effort to significantly increase the size of the CMap database and along with it, our potential to connect human diseases with the genes that underlie them and the drugs that treat them. The NIH has funded a large expansion of the Connectivity Map dataset through the Library of Integrated Network-based Cellular Signatures (LINCS). The Broad Institute's LINCS center aims to create a first installment of data generation and analysis for the LINCS program. Through these data LINCS intends to accelerate the discovery process by systematically revealing connections between genes/compounds discovered in screens and molecular pathways that underlie disease states.

View all literature mentions

SIDER (tool)

RRID:SCR_004321

Database containing information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts. The available information include side effect frequency, drug and side effect classifications as well as links to further information, for example drug-target relations. The SIDER Side Effect Resource represents an effort to aggregate dispersed public information on side effects. To our knowledge, no such resource exist in machine-readable form despite the importance of research on drugs and their effects. The creation of this resource was motivated by the many requests for data that we received related to our paper (Campillos, Kuhn et al., Science, 2008, 321(5886):263-6.) on the utilization of side effects for drug target prediction. Inclusion of side effects as readouts for drug treatment should have many applications and we hope to be able to enhance the respective research with this resource. You may browse the drugs by name, browse the side effects by name, download the current version of SIDER, or use the search interface.

View all literature mentions

Pfam (tool)

RRID:SCR_004726

A database of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Users can analyze protein sequences for Pfam matches, view Pfam family annotation and alignments, see groups of related families, look at the domain organization of a protein sequence, find the domains on a PDB structure, and query Pfam by keywords. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families that may automatically generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans (collections of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM).

View all literature mentions

MeSH (tool)

RRID:SCR_004750

A controlled vocabulary thesaurus that consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity. MeSH, in machine-readable form, is provided at no charge via electronic means. MeSH descriptors are arranged in both an alphabetic and a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as Anatomy or Mental Disorders. More specific headings are found at more narrow levels of the twelve-level hierarchy, such as Ankle and Conduct Disorder. There are 27,149 descriptors in 2014 MeSH. There are also over 218,000 entry terms that assist in finding the most appropriate MeSH Heading, for example, Vitamin C is an entry term to Ascorbic Acid. In addition to these headings, there are more than 219,000 headings called Supplementary Concept Records (formerly Supplementary Chemical Records) within a separate thesaurus. The MeSH thesaurus is used by NLM for indexing articles from 5,400 of the world''''s leading biomedical journals for the MEDLINE/PubMED database. It is also used for the NLM-produced database that includes cataloging of books, documents, and audiovisuals acquired by the Library. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.

View all literature mentions

Cluster (tool)

RRID:SCR_013505

Software R package. Methods for Cluster analysis. Performs variety of types of cluster analysis and other types of processing on large microarray datasets.

View all literature mentions