Source author record

Sebastian Deorowicz

Sebastian Deorowicz appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Engineering, Finance, and Science Information Theory math.IT Quantitative Methods

Catalog footprint

What is connected

5works

5topics

3close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2015arXiv

FM-index for dummies

The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rank-handling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2--3 times faster than the fastest known ones, for the price of using typically 1.5--5 times more space.

preprint2014arXiv

Disk-based genome sequencing data compression

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk-based~(Yanovsky, 2011; Cox et al., 2012), where the better of these two, from Cox~{\it et al.}~(2012), is based on the Burrows--Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gb human genome sequencing collection with almost 45-fold coverage. Results: We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gb dataset into only 5.31 GB of space. Availability: http://sun.aei.polsl.pl/orcom under a free license.

preprint2013arXiv

Efficient algorithms for the longest common subsequence in $k$-length substrings

Finding the longest common subsequence in $k$-length substrings (LCS$k$) is a recently proposed problem motivated by computational biology. This is a generalization of the well-known LCS problem in which matching symbols from two sequences $A$ and $B$ are replaced with matching non-overlapping substrings of length $k$ from $A$ and $B$. We propose several algorithms for LCS$k$, being non-trivial incarnations of the major concepts known from LCS research (dynamic programming, sparse dynamic programming, tabulation). Our algorithms make use of a linear-time and linear-space preprocessing finding the occurrences of all the substrings of length $k$ from one sequence in the other sequence.

preprint2011arXiv

Engineering Relative Compression of Genomes

Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over a order of magnitude greater. One of the new successful ideas is augmenting the reference sequence with phrases from the other sequences, making more LZ-matches available.