X
Forgot Password

If you have forgotten your password you can enter your email here and get a temporary password sent to your email.

Building a community-driven bioinformatics platform to facilitate Cannabis sativa multi-omics research.

Locedie Mansueto | Tobias Kretzschmar | Ramil Mauleon | Graham J King
GigaByte (Hong Kong, China) | 2024

Global changes in cannabis legislation after decades of stringent regulation and heightened demand for its industrial and medicinal applications have spurred recent genetic and genomics research. An international research community emerged and identified the need for a web portal to host cannabis-specific datasets that seamlessly integrates multiple data sources and serves omics-type analyses, fostering information sharing. The Tripal platform was used to host public genome assemblies, gene annotations, quantitative trait loci and genetic maps, gene and protein expression data, metabolic profiles and their sample attributes. Single nucleotide polymorphisms were called using public resequencing datasets on three genomes. Additional applications, such as SNP-Seek and MapManJS, were embedded into Tripal. A multi-omics data integration web-service Application Programming Interface (API), developed on top of existing Tripal modules, returns generic tables of samples, properties and values. Use cases demonstrate the API's utility for various omics analyses, enabling researchers to perform multi-omics analyses efficiently.

Pubmed ID: 39469541 RIS Download

Research resources used in this publication

Additional research tools detected in this publication

Antibodies used in this publication

None found

Associated grants

None

Publication data is provided by the National Library of Medicine ® and PubMed ®. Data is retrieved from PubMed ® on a weekly schedule. For terms and conditions see the National Library of Medicine Terms and Conditions.

This is a list of tools and resources that we have found mentioned in this publication.


Gene Ontology (tool)

RRID:SCR_002811

Computable knowledge regarding functions of genes and gene products. GO resources include biomedical ontologies that cover molecular domains of all life forms as well as extensive compilations of gene product annotations to these ontologies that provide largely species-neutral, comprehensive statements about what gene products do. Used to standardize representation of gene and gene product attributes across species and databases.

View all literature mentions

MCScanX (tool)

RRID:SCR_022067

Software toolkit for detection and evolutionary analysis of gene synteny and collinearity.

View all literature mentions

Crop Ontology (data or information resource)

RRID:SCR_010299

Ontology that includes crop-specific trait ontologies for several economically important plants like rice, wheat, maize, potato, musa, chickpea and sorghum along with other important domains for crop research such as germplasm, passport, trait measurement scales, experimental design factors etc.

View all literature mentions

Pandas (software resource)

RRID:SCR_018214

Software Python package for data analysis providing labeled data structures similar to R data. Provides data structures designed to make working with relational or labeled data. Software as building block for doing practical, real world open source data analysis and manipulation tool.

View all literature mentions

Weighted Gene Co-expression Network Analysis (software resource)

RRID:SCR_003302

Software R package for weighted correlation network analysis. WGCNA is also available as point-and-click application. Unfortunately this application is not maintained anymore. It is known to have compatibility problems with R-2.8.x and newer, and the methods it implements are not all state of the art.

View all literature mentions

Jupyter Notebook (web application)

RRID:SCR_018315

Open source web application to create and share documents that contain live code, equations, visualizations and narrative text. Used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning.

View all literature mentions

ComBat (software resource)

RRID:SCR_010974

Adjusting batch effects in microarray expression data using Empirical Bayes methods.

View all literature mentions

PLINK (software resource)

RRID:SCR_001757

Open source whole genome association analysis toolset, designed to perform range of basic, large scale analyses in computationally efficient manner. Used for analysis of genotype/phenotype data. Through integration with gPLINK and Haploview, there is some support for subsequent visualization, annotation and storage of results. PLINK 1.9 is improved and second generation of the software.

View all literature mentions

BLASTN (data analysis service)

RRID:SCR_001598

Web application to search nucleotide databases using a nucleotide query. Algorithms: blastn, megablast, discontiguous megablast.

View all literature mentions

MapMan (software resource)

RRID:SCR_003543

Software tool that displays large genomics datasets (e.g. gene expression data from Arabidopsis Affymetrix arrays) onto diagrams of metabolic pathways or other biological processes.

View all literature mentions

SAMtools/BCFtools (software resource)

RRID:SCR_005227

Provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

View all literature mentions

Salmon (software resource)

RRID:SCR_017036

Software tool for quantifying expression of transcripts using RNA-seq data. Provides fast and bias-aware quantification of transcript expression. Transcriptome-wide quantifier to correct for fragment GC-content bias.

View all literature mentions

GMAP (software resource)

RRID:SCR_008992

THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. A software program for mapping and aligning cDNA sequences to a genome. The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing.

View all literature mentions

InterProScan (software resource)

RRID:SCR_005829

Software package for functional analysis of sequences by classifying them into families and predicting presence of domains and sites. Scans sequences against InterPro's signatures. Characterizes nucleotide or protein function by matching it with models from several different databases. Used in large scale analysis of whole proteomes, genomes and metagenomes. Available as Web based version and standalone Perl version and SOAP Web Service.

View all literature mentions

rnaSPAdes (software resource)

RRID:SCR_016992

Software tool for assembling transcripts from RNA-Seq data. Explores surprising computational parallels between assembly of transcriptomes and single cell genomes. Suitable for all kind of organisms. Part of SPAdes package since version 3.9.

View all literature mentions

TransDecoder (data processing software)

RRID:SCR_017647

Software tool to identify candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to genome using Tophat and Cufflinks.Starts from FASTA or GFF file. Can scan and retain open reading frames (ORFs) for homology to known proteins by using BlastP or Pfam search and incorporate results into obtained selection. Predictions can then be visualized by using genome browser such as IGV.

View all literature mentions

MMseqs2 (software resource)

RRID:SCR_022962

Software suite for ultra fast and sensitive sequence search and clustering. Used to search and cluster huge protein and nucleotide sequence sets. Designed to run on multiple cores and servers.

View all literature mentions

GATK (software resource)

RRID:SCR_001876

A software package to analyze next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size. This software library makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner. (entry from Genetic Analysis Software)

View all literature mentions

gffread (data processing software)

RRID:SCR_018965

Open source software tool to manipulate files in GFF format. Used to convert, sort, filter, transform, or cluster genomic features.

View all literature mentions

InterPro (software resource)

RRID:SCR_006695

Service providing functional analysis of proteins by classifying them into families and predicting domains and important sites. They combine protein signatures from a number of member databases into a single searchable resource, capitalizing on their individual strengths to produce a powerful integrated database and diagnostic tool. This integrated database of predictive protein signatures is used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures. You can access the data programmatically, via Web Services. The member databases use a number of approaches: # ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST. # PROSITE patterns: provider of simple regular expressions. # PROSITE and HAMAP profiles: provide sequence matrices. # PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs). # PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs). Your contributions are welcome. You are encouraged to use the ''''Add your annotation'''' button on InterPro entry pages to suggest updated or improved annotation for individual InterPro entries.

View all literature mentions

CannSeek Database of Cannabis sativa SNPs (database)

RRID:SCR_025579

Database of Cannabis sativa Single Nucleotide Polymorphisms discovered from NGS sequences and BioSamples available in NCBI. Sequence reads were called against cs10, Purple kush, and Finola reference assemblies using GATK and Parabricks pipelines.

View all literature mentions

Trimmomatic (data processing software)

RRID:SCR_011848

Software Java pipeline for trimming tasks for Illumina paired end and single ended data. Flexible Trimmer for Illumina Sequence Data. Pair aware preprocessing tool optimized for Illumina next generation sequencing data. Includes several processing steps for read trimming and filtering. Operating systems Unix/Linux, Mac OS, Windows.

View all literature mentions

Gene Expression Omnibus (GEO) (data repository)

RRID:SCR_005012

Functional genomics data repository supporting MIAME-compliant data submissions. Includes microarray-based experiments measuring the abundance of mRNA, genomic DNA, and protein molecules, as well as non-array-based technologies such as serial analysis of gene expression (SAGE) and mass spectrometry proteomic technology. Array- and sequence-based data are accepted. Collection of curated gene expression DataSets, as well as original Series and Platform records. The database can be searched using keywords, organism, DataSet type and authors. DataSet records contain additional resources including cluster tools and differential expression queries.

View all literature mentions

OBO (data or information resource)

RRID:SCR_007083

A collaboration involving developers of science-based ontologies who are establishing a set of principles for ontology development with the goal of creating a suite of orthogonal interoperable reference ontologies in the biomedical domain. In addition to a listing of OBO ontologies, this site provides a statement of the OBO Foundry principles, discussion fora, technical infrastructure, and other services to facilitate ontology development. Feedback is welcome and participation encouraged.

View all literature mentions

UniProt (data or information resource)

RRID:SCR_002380

Collection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.

View all literature mentions

RefSeq (data or information resource)

RRID:SCR_003496

Collection of curated, non-redundant genomic DNA, transcript RNA, and protein sequences produced by NCBI. Provides a reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. Accessed through the Nucleotide and Protein databases.

View all literature mentions

NCBI BLAST (software resource)

RRID:SCR_004870

Web search tool to find regions of similarity between biological sequences. Program compares nucleotide or protein sequences to sequence databases and calculates statistical significance. Used for identifying homologous sequences.

View all literature mentions

InterMine (software resource)

RRID:SCR_001772

An open source data warehouse system built for the integration and analysis of complex biological data that enables the creation of biological databases accessed by sophisticated web query tools. Parsers are provided for integrating data from many common biological data sources and formats, and there is a framework for adding data. InterMine includes a user-friendly web interface that works "out of the box" and can be easily customized for specific needs, as well as a powerful, scriptable web-service API to allow programmatic access to data.

View all literature mentions

Chado (software resource)

RRID:SCR_024073

Relational database schema that underlies many GMOD installations. It is capable of representing many of the general classes of data frequently encountered in modern biology such as sequence, sequence comparisons, phenotypes, genotypes, ontologies, publications, and phylogeny. It has been designed to handle complex representations of biological knowledge and should be considered one of the most sophisticated relational schemas currently available in molecular biology. The price of this capability is that the new user must spend some time becoming familiar with its fundamentals.

View all literature mentions

JBrowse (software resource)

RRID:SCR_001004

A high-performance visualization tool for interactive exploration of large, integrated genomic datasets written primarily in JavaScript. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

View all literature mentions

MetaCyc (data or information resource)

RRID:SCR_007778

MetaCyc is a database of nonredundant, experimentally elucidated metabolic pathways. MetaCyc contains more than 1,200 pathways from more than 1,600 different organisms, and is curated from the scientific experimental literature. MetaCyc contains pathways involved in both primary and secondary metabolism, as well as associated compounds, enzymes, and genes.

View all literature mentions

NCBI Sequence Read Archive (SRA) (data repository)

RRID:SCR_004891

Repository of raw sequencing data from next generation of sequencing platforms including including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT. In addition to raw sequence data, SRA now stores alignment information in form of read placements on reference sequence. Data submissions are welcome. Archive of high throughput sequencing data,part of international partnership of archives (INSDC) at NCBI, European Bioinformatics Institute and DNA Database of Japan. Data submitted to any of this three organizations are shared among them.

View all literature mentions