FDI Lab - SciCrunch.org | Searching for in Literature

Data compression for sequencing data.

Sebastian Deorowicz‎ et al.
Algorithms for molecular biology : AMB‎
2013‎

: Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question "why compression" in a quantitative manner. Then we also answer the questions "what" and "how", by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question "why compression" and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology.

Optical data compression in time stretch imaging.

Claire Lifan Chen‎ et al.
PloS one‎
2015‎

Time stretch imaging offers real-time image acquisition at millions of frames per second and subnanosecond shutter speed, and has enabled detection of rare cancer cells in blood with record throughput and specificity. An unintended consequence of high throughput image acquisition is the massive amount of digital data generated by the instrument. Here we report the first experimental demonstration of real-time optical image compression applied to time stretch imaging. By exploiting the sparsity of the image, we reduce the number of samples and the amount of data generated by the time stretch camera in our proof-of-concept experiments by about three times. Optical data compression addresses the big data predicament in such systems.

Compression of structured high-throughput sequencing data.

Fabien Campagne‎ et al.
PloS one‎
2013‎

Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.

Compression of FASTQ and SAM format sequencing data.

James K Bonfield‎ et al.
PloS one‎
2013‎

Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.

MFCompress: a compression tool for FASTA and multi-FASTA data.

Armando J Pinho‎ et al.
Bioinformatics (Oxford, England)‎
2014‎

The data deluge phenomenon is becoming a serious problem in most genomic centers. To alleviate it, general purpose tools, such as gzip, are used to compress the data. However, although pervasive and easy to use, these tools fall short when the intention is to reduce as much as possible the data, for example, for medium- and long-term storage. A number of algorithms have been proposed for the compression of genomics data, but unfortunately only a few of them have been made available as usable and reliable compression tools.

Dynamic CT perfusion image data compression for efficient parallel processing.

Renan Sales Barros‎ et al.
Medical & biological engineering & computing‎
2016‎

The increasing size of medical imaging data, in particular time series such as CT perfusion (CTP), requires new and fast approaches to deliver timely results for acute care. Cloud architectures based on graphics processing units (GPUs) can provide the processing capacity required for delivering fast results. However, the size of CTP datasets makes transfers to cloud infrastructures time-consuming and therefore not suitable in acute situations. To reduce this transfer time, this work proposes a fast and lossless compression algorithm for CTP data. The algorithm exploits redundancies in the temporal dimension and keeps random read-only access to the image elements directly from the compressed data on the GPU. To the best of our knowledge, this is the first work to present a GPU-ready method for medical image compression with random access to the image elements from the compressed data.

Data structures and compression algorithms for high-throughput sequencing technologies.

Kenny Daily‎ et al.
BMC bioinformatics‎
2010‎

High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. As experiments scale up, HTS technologies create new bioinformatics challenges for the storage and sharing of HTS data.

GReEn: a tool for efficient compression of genome resequencing data.

Armando J Pinho‎ et al.
Nucleic acids research‎
2012‎

Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/~ap/codecs/GReEn1.tar.gz.

A novel compression tool for efficient storage of genome resequencing data.

Congmao Wang‎ et al.
Nucleic acids research‎
2011‎

With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ∼159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.

Efficient genotype compression and analysis of large genetic-variation data sets.

Ryan M Layer‎ et al.
Nature methods‎
2016‎

Genotype Query Tools (GQT) is an indexing strategy that expedites analyses of genome-variation data sets in Variant Call Format based on sample genotypes, phenotypes and relationships. GQT's compressed genotype index minimizes decompression for analysis, and its performance relative to that of existing methods improves with cohort size. We show substantial (up to 443-fold) gains in performance over existing methods and demonstrate GQT's utility for exploring massive data sets involving thousands to millions of genomes. GQT can be accessed at https://github.com/ryanlayer/gqt.

Performance evaluation of lossy quality compression algorithms for RNA-seq data.

Rongshan Yu‎ et al.
BMC bioinformatics‎
2020‎

Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear.

NGC: lossless and lossy compression of aligned high-throughput sequencing data.

Niko Popitsch‎ et al.
Nucleic acids research‎
2013‎

A major challenge of current high-throughput sequencing experiments is not only the generation of the sequencing data itself but also their processing, storage and transmission. The enormous size of these data motivates the development of data compression algorithms usable for the implementation of the various storage policies that are applied to the produced intermediate and final result files. In this article, we present NGC, a tool for the compression of mapped short read data stored in the wide-spread SAM format. NGC enables lossless and lossy compression and introduces the following two novel ideas: first, we present a way to reduce the number of required code words by exploiting common features of reads mapped to the same genomic positions; second, we present a highly configurable way for the quantization of per-base quality values, which takes their influence on downstream analyses into account. NGC, evaluated with several real-world data sets, saves 33-66% of disc space using lossless and up to 98% disc space using lossy compression. By applying two popular variant and genotype prediction tools to the decompressed data, we could show that the lossy compression modes preserve >99% of all called variants while outperforming comparable methods in some configurations.

Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures.

Kelly Patrick Stanton‎ et al.
Nucleic acids research‎
2013‎

Researchers generating new genome-wide data in an exploratory sequencing study can gain biological insights by comparing their data with well-annotated data sets possessing similar genomic patterns. Data compression techniques are needed for efficient comparisons of a new genomic experiment with large repositories of publicly available profiles. Furthermore, data representations that allow comparisons of genomic signals from different platforms and across species enhance our ability to leverage these large repositories. Here, we present a signal processing approach that characterizes protein-chromatin interaction patterns at length scales of several kilobases. This allows us to efficiently compare numerous chromatin-immunoprecipitation sequencing (ChIP-seq) data sets consisting of many types of DNA-binding proteins collected from a variety of cells, conditions and organisms. Importantly, these interaction patterns broadly reflect the biological properties of the binding events. To generate these profiles, termed Arpeggio profiles, we applied harmonic deconvolution techniques to the autocorrelation profiles of the ChIP-seq signals. We used 806 publicly available ChIP-seq experiments and showed that Arpeggio profiles with similar spectral densities shared biological properties. Arpeggio profiles of ChIP-seq data sets revealed characteristics that are not easily detected by standard peak finders. They also allowed us to relate sequencing data sets from different genomes, experimental platforms and protocols. Arpeggio is freely available at http://sourceforge.net/p/arpeggio/wiki/Home/.

NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search.

Oluwafemi A Sarumi‎ et al.
Computational and structural biotechnology journal‎
2024‎

The availability of high throughput sequencing tools coupled with the declining costs in the production of DNA sequences has led to the generation of enormous amounts of omics data curated in several databases such as NCBI and EMBL. Identification of similar DNA sequences from these databases is one of the fundamental tasks in bioinformatics. It is essential for discovering homologous sequences in organisms, phylogenetic studies of evolutionary relationships among several biological entities, or detection of pathogens. Improving DNA similarity search is of outmost importance because of the increased complexity of the evergrowing repositories of sequences. Therefore, instead of using the conventional approach of comparing raw sequences, e.g., in fasta format, a numerical representation of the sequences can be used to calculate their similarities and optimize the search process. In this study, we analyzed different approaches for numerical embeddings, including Chaos Game Representation, hashing, and neural networks, and compared them with classical approaches such as principal component analysis. It turned out that neural networks generate embeddings that are able to capture the similarity between DNA sequences as a distance measure and outperform the other approaches on DNA similarity search, significantly.

Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.

Kelvin V Kredens‎ et al.
PloS one‎
2020‎

The recent decrease in cost and time to sequence and assemble of complete genomes created an increased demand for data storage. As a consequence, several strategies for assembled biological data compression were created. Vertical compression tools implement strategies that take advantage of the high level of similarity between multiple assembled genomic sequences for better compression results. However, current reviews on vertical compression do not compare the execution flow of each tool, which is constituted by phases of preprocessing, transformation, and data encoding. We performed a systematic literature review to identify and compare existing tools for vertical compression of assembled genomic sequences. The review was centered on PubMed and Scopus, in which 45726 distinct papers were considered. Next, 32 papers were selected according to the following criteria: to present a lossless vertical compression tool; to use the information contained in other sequences for the compression; to be able to manipulate genomic sequences in FASTA format; and no need prior knowledge. Although we extracted performance compression results, they were not compared as the tools did not use a standardized evaluation protocol. Thus, we conclude that there's a lack of definition of an evaluation protocol that must be applied by each tool.

A data reduction and compression description for high throughput time-resolved electron microscopy.

Abhik Datta‎ et al.
Nature communications‎
2021‎

Fast, direct electron detectors have significantly improved the spatio-temporal resolution of electron microscopy movies. Preserving both spatial and temporal resolution in extended observations, however, requires storing prohibitively large amounts of data. Here, we describe an efficient and flexible data reduction and compression scheme (ReCoDe) that retains both spatial and temporal resolution by preserving individual electron events. Running ReCoDe on a workstation we demonstrate on-the-fly reduction and compression of raw data streaming off a detector at 3 GB/s, for hours of uninterrupted data collection. The output was 100-fold smaller than the raw data and saved directly onto network-attached storage drives over a 10 GbE connection. We discuss calibration techniques that support electron detection and counting (e.g., estimate electron backscattering rates, false positive rates, and data compressibility), and novel data analysis methods enabled by ReCoDe (e.g., recalibration of data post acquisition, and accurate estimation of coincidence loss).

A combined HMM-PCNN model in the contourlet domain for image data compression.

Guoan Yang‎ et al.
PloS one‎
2020‎

Multiscale geometric analysis (MGA) is not only characterized by multi-resolution, time-frequency localization, multidirectionality and anisotropy, but also outdoes the limitations of wavelet transform in representing high-dimensional singular data such as edges and contours. Therefore, researchers have been exploring new MGA-based image compression standards rather than the JPEG2000 standard. However, due to the difference in terms of the data structure, redundancy and decorrelation between wavelet and MGA, as well as the complexity of the coding scheme, so far, no definitive researches have been reported on the MGA-based image coding schemes. In addressing this problem, this paper proposes an image data compression approach using the hidden Markov model (HMM)/pulse-coupled neural network (PCNN) model in the contourlet domain. First, a sparse decomposition of an image was performed using a contourlet transform to obtain the coefficients that show the multiscale and multidirectional characteristics. An HMM was then adopted to establish links between coefficients in neighboring subbands of different levels and directions. An Expectation-Maximization (EM) algorithm was also adopted in training the HMM in order to estimate the state probability matrix, which maintains the same structure of the contourlet decomposition coefficients. In addition, each state probability can be classified by the PCNN based on the state probability distribution. Experimental results show that the HMM/PCNN -contourlet model proposed in this paper leads to better compression performance and offer a more flexible encoding scheme.

TRCMGene: A two-step referential compression method for the efficient storage of genetic data.

You Tang‎ et al.
PloS one‎
2018‎

The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable length of time for decompression and loading.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Gaëtan Benoit‎ et al.
BMC bioinformatics‎
2015‎

Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.

Data set for influence of blends of diesel and renewable fuels on compression ignition engine emissions.

A S van Niekerk‎ et al.
Data in brief‎
2020‎

The present data article is based on the research work which investigates the influence of blends of diesel and renewable fuels on compression ignition engine emissions. In this experimental work, a 2.4 L, turbocharged, direct injection compression ignition engine and water brake dynamometer were used. Different ternary blends were created by mixing diesel, biodiesel and ethanol together in accordance with a mixture design of experiments. The homogeneity of each ternary blend was qualitatively checked by observing the samples for 24 hours for visible separation. The engine was run over the WLTP drive cycle for each individual ternary blend and the exhaust emissions were recorded. NOVA 7466K and TESTO 350 gas analysers were used to record the exhaust emissions. A factory standard MAF sensor was used to record the inlet air mass flow and an aftermarket ECU was used to determine the fuel flow. The ternary blends were blended using standard laboratory measuring equipment.

Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

Data compression for sequencing data.

Optical data compression in time stretch imaging.

Compression of structured high-throughput sequencing data.

Compression of FASTQ and SAM format sequencing data.

MFCompress: a compression tool for FASTA and multi-FASTA data.

Dynamic CT perfusion image data compression for efficient parallel processing.

Data structures and compression algorithms for high-throughput sequencing technologies.

GReEn: a tool for efficient compression of genome resequencing data.

A novel compression tool for efficient storage of genome resequencing data.

Efficient genotype compression and analysis of large genetic-variation data sets.

Performance evaluation of lossy quality compression algorithms for RNA-seq data.

NGC: lossless and lossy compression of aligned high-throughput sequencing data.

Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures.

NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search.

Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.

A data reduction and compression description for high throughput time-resolved electron microscopy.

A combined HMM-PCNN model in the contourlet domain for image data compression.

TRCMGene: A two-step referential compression method for the efficient storage of genetic data.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Data set for influence of blends of diesel and renewable fuels on compression ignition engine emissions.

SciCrunch.org Resources

Navigation

Logging in and Registering

Searching

Save Your Search

Query Expansion

Collections

Facets

Options

Further Questions

About

Recent News Entries

Contact Us

SciCrunch

Searching across hundreds of databases

Our searching services are busy right now. Your search will reload in five seconds.

Log in

Log in

Literature

Options

Facets

Recent searches

.in-collection { color: green; } Data compression for sequencing data.

.in-collection { color: green; } Optical data compression in time stretch imaging.

.in-collection { color: green; } Compression of structured high-throughput sequencing data.

.in-collection { color: green; } Compression of FASTQ and SAM format sequencing data.

.in-collection { color: green; } MFCompress: a compression tool for FASTA and multi-FASTA data.

.in-collection { color: green; } Dynamic CT perfusion image data compression for efficient parallel processing.

.in-collection { color: green; } Data structures and compression algorithms for high-throughput sequencing technologies.

.in-collection { color: green; } GReEn: a tool for efficient compression of genome resequencing data.

.in-collection { color: green; } A novel compression tool for efficient storage of genome resequencing data.

.in-collection { color: green; } Efficient genotype compression and analysis of large genetic-variation data sets.

.in-collection { color: green; } Performance evaluation of lossy quality compression algorithms for RNA-seq data.

.in-collection { color: green; } NGC: lossless and lossy compression of aligned high-throughput sequencing data.

.in-collection { color: green; } Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures.

.in-collection { color: green; } NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search.

.in-collection { color: green; } Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.

.in-collection { color: green; } A data reduction and compression description for high throughput time-resolved electron microscopy.

.in-collection { color: green; } A combined HMM-PCNN model in the contourlet domain for image data compression.

.in-collection { color: green; } TRCMGene: A two-step referential compression method for the efficient storage of genetic data.

.in-collection { color: green; } Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

.in-collection { color: green; } Data set for influence of blends of diesel and renewable fuels on compression ignition engine emissions.

SciCrunch.org Resources

Navigation

Logging in and Registering

Searching

Save Your Search

Query Expansion

Collections

Facets

Options

Further Questions

Publications Per Year

About

Recent News Entries

Contact Us

SciCrunch

Data compression for sequencing data.

Optical data compression in time stretch imaging.

Compression of structured high-throughput sequencing data.

Compression of FASTQ and SAM format sequencing data.

MFCompress: a compression tool for FASTA and multi-FASTA data.

Dynamic CT perfusion image data compression for efficient parallel processing.

Data structures and compression algorithms for high-throughput sequencing technologies.

GReEn: a tool for efficient compression of genome resequencing data.

A novel compression tool for efficient storage of genome resequencing data.

Efficient genotype compression and analysis of large genetic-variation data sets.

Performance evaluation of lossy quality compression algorithms for RNA-seq data.

NGC: lossless and lossy compression of aligned high-throughput sequencing data.

Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures.

NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search.

Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.

A data reduction and compression description for high throughput time-resolved electron microscopy.

A combined HMM-PCNN model in the contourlet domain for image data compression.

TRCMGene: A two-step referential compression method for the efficient storage of genetic data.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Data set for influence of blends of diesel and renewable fuels on compression ignition engine emissions.