Paper detail

Compression of high throughput sequencing data with probabilistic de Bruijn graph

Motivation: Data volumes generated by next-generation sequencing technolo- gies is now a major concern, both for storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip. Most reference-free tools developed for NGS data compression still use general text compression methods and fail to benefit from algorithms already designed specifically for the analysis of NGS data. The goal of our new method Leon is to achieve compression of DNA sequences of high throughput sequencing data, without the need of a reference genome, with techniques derived from existing assembly principles, that possibly better exploit NGS data redundancy. Results: We propose a novel method, implemented in the software Leon, for compression of DNA sequences issued from high throughput sequencing technologies. This is a lossless method that does not need a reference genome. Instead, a reference is built de novo from the set of reads as a probabilistic de Bruijn Graph, stored in a Bloom filter. Each read is encoded as a path in this graph, storing only an anchoring kmer and a list of bifurcations indicating which path to follow in the graph. This new method will allow to have compressed read files that also already contain its underlying de Bruijn Graph, thus directly re-usable by many tools relying on this structure. Leon achieved encoding of a C. elegans reads set with 0.7 bits/base, outperforming state of the art reference-free methods. Availability: Open source, under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/

preprint2014arXivOpen access

Signal facts

What is known right now

Open access4 authors2 topics

Next steps

Decide what to do with this paper

Use like or dislike for the fast social read. The more specific scholarly feedback stays available below when needed.

Log in to curate

Reading frame

Keep the important context close to the paper

Keep the important signals around this paper in one place: votes, save state, collection context, reviews and the metadata you need before deciding what to do next.

Institutions

Add specific reaction

Move through the context

Research map

Open full explorer

Move through nearby people, institutions, topics and adjacent work without leaving the paper page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Structured reviews

0 review(s)

ContributeLeave structured feedbackUse the review template when you have a concrete strength, concern or method question.Open review form

No structured reviews yet. High-signal critique starts here.

Work discussion

0 comment(s)

DiscussAdd a high-signal commentKeep quick notes, caveats and replication pointers separate from formal reviews.Open comment form

No discussion yet. The first strong comment sets the tone.