Source author record

Szymon Grabowski

Szymon Grabowski appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Engineering, Finance, and Science Information Theory math.IT Quantitative Methods

Catalog footprint

What is connected

20works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

SOPanG 2: online searching over a pan-genome without false positives

Motivation: The pan-genome can be stored as elastic-degenerate (ED) string, a recently introduced compact representation of multiple overlapping sequences. However, a search over the ED string does not indicate which individuals (if any) match the entire query. Results: We augment the ED string with sources (individuals' indexes) and propose an extension of the SOPanG (Shift-Or for Pan-Genome) tool to report only true positive matches, omitting those not occurring in any of the haplotypes. The additional stage for checking the matches yields a penalty of less than 3.5% relative speed in practice, which means that SOPanG 2 is able to report pattern matches in a pan-genome, mapping them onto individuals, at the single-thread throughput of above 430 MB/s on real data. Availability and implementation: SOPanG 2 can be downloaded here: github.com/MrAlexSee/sopang

preprint2016arXiv

A practical index for approximate dictionary matching with few mismatches

Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in $q$-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low.

preprint2016arXiv

Rank and select: Another lesson learned

Rank and select queries on bitmaps are essential building bricks of many compressed data structures, including text indexes, membership and range supporting spatial data structures, compressed graphs, and more. Theoretically considered yet in 1980s, these primitives have also been a subject of vivid research concerning their practical incarnations in the last decade. We present a few novel rank/select variants, focusing mostly on speed, obtaining competitive space-time results in the compressed setting. Our findings can be summarized as follows: $(i)$ no single rank/select solution works best on any kind of data (ours are optimized for concatenated bit arrays obtained from wavelet trees for real text datasets), $(ii)$ it pays to efficiently handle blocks consisting of all 0 or all 1 bits, $(iii)$ compressed select does not have to be significantly slower than compressed rank at a comparable memory use.

preprint2016arXiv

Suffix arrays with a twist

The suffix array is a classic full-text index, combining effectiveness with simplicity. We discuss three approaches aiming to improve its efficiency even more: changes to the navigation, data layout and adding extra data. In short, we show that $(i)$ how we search for the right interval boundary impacts significantly the overall search speed, $(ii)$ a B-tree data layout easily wins over the standard one, $(iii)$ the well-known idea of a lookup table for the prefixes of the suffixes can be refined with using compression, $(iv)$ caching prefixes of the suffixes in a helper array can pose a(nother) practical space-time tradeoff.

preprint2016arXiv

Two simple full-text indexes based on the suffix array

We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix array with a hash table to speed up pattern searches due to significantly narrowed search interval before the binary search phase. The other, called FBCSA, is a compact data structure, similar to M{ä}kinen's compact suffix array, but working on fixed sized blocks. Experiments on the Pizza~\&~Chili 200\,MB datasets show that SA-hash is about 2--3 times faster in pattern searches (counts) than the standard suffix array, for the price of requiring $0.2n-1.1n$ bytes of extra space, where $n$ is the text length, and setting a minimum pattern length. FBCSA is relatively fast in single cell accesses (a few times faster than related indexes at about the same or better compression), but not competitive if many consecutive cells are to be extracted. Still, for the task of extracting, e.g., 10 successive cells its time-space relation remains attractive.

preprint2015arXiv

A bloated FM-index reducing the number of cache misses during the search

The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at "random" locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FM-index by working on $q$-grams rather than individual characters, at the cost of using more space. The first presented variant is related to an inverted index on $q$-grams, yet the occurrence lists in our solution are in the sorted suffix order rather than text order in a traditional inverted index. This variant obtains $O(m/|CL| + \log n \log m)$ cache misses in the worst case, where $n$ and $m$ are the text and pattern lengths, respectively, and $|CL|$ is the CPU cache line size, in symbols (typically 64 in modern hardware). This index is often several times faster than the fastest known FM-indexes (especially for long patterns), yet the space requirements are enormous, $O(n\log^2 n)$ bits in theory and about $80n$-$95n$ bytes in practice. For this reason, we dub our approach FM-bloated. The second presented variant requires $O(n\log n)$ bits of space.

preprint2015arXiv

A Bloom filter based semi-index on $q$-grams

We present a simple $q$-gram based semi-index, which allows to look for a pattern typically only in a small fraction of text blocks. Several space-time tradeoffs are presented. Experiments on Pizza & Chili datasets show that our solution is up to three orders of magnitude faster than the Claude et al. \cite{CNPSTjda10} semi-index at a comparable space usage.

preprint2015arXiv

A note on the longest common Abelian factor problem

Abelian string matching problems are becoming an object of considerable interest in last years. Very recently, Alatabbi et al. \cite{AILR2015} presented the first solution for the longest common Abelian factor problem for a pair of strings, reaching $O(σn^2)$ time with $O(σn \log n)$ bits of space, where $n$ is the length of the strings and $σ$ is the alphabet size. In this note we show how the time complexity can be preserved while the space is reduced by a factor of $σ$, and then how the time complexity can be improved, if the alphabet is not too small, when superlinear space is allowed.

preprint2015arXiv

FM-index for dummies

The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rank-handling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2--3 times faster than the fastest known ones, for the price of using typically 1.5--5 times more space.

preprint2014arXiv

A note on the longest common substring with $k$-mismatches problem

The recently introduced longest common substring with $k$-mismatches ($k$-LCF) problem is to find, given two sequences $S_1$ and $S_2$ of length $n$ each, a longest substring $A_1$ of $S_1$ and $A_2$ of $S_2$ such that the Hamming distance between $A_1$ and $A_2$ is at most $k$. So far, the only subquadratic time result for this problem was known for $k = 1$~\cite{FGKU2014}. We first present two output-dependent algorithms solving the $k$-LCF problem and show that for $k = O(\log^{1-\varepsilon} n)$, where $\varepsilon > 0$, at least one of them works in subquadratic time, using $O(n)$ words of space. The choice of one of these two algorithms to be applied for a given input can be done after linear time and space preprocessing. Finally we present a tabulation-based algorithm working, in its range of applicability, in $O(n^2\log\min(k+\ell_0, σ)/\log n)$ time, where $\ell_0$ is the length of the standard longest common substring.

preprint2014arXiv

Disk-based genome sequencing data compression

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk-based~(Yanovsky, 2011; Cox et al., 2012), where the better of these two, from Cox~{\it et al.}~(2012), is based on the Burrows--Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gb human genome sequencing collection with almost 45-fold coverage. Results: We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gb dataset into only 5.31 GB of space. Availability: http://sun.aei.polsl.pl/orcom under a free license.

preprint2014arXiv

Motif matching using gapped patterns

We present new algorithms for the problem of multiple string matching of gapped patterns, where a gapped pattern is a sequence of strings such that there is a gap of fixed length between each two consecutive strings. The problem has applications in the discovery of transcription factor binding sites in DNA sequences when using generalized versions of the Position Weight Matrix model to describe transcription factor specificities. In these models a motif can be matched as a set of gapped patterns with unit-length keywords. The existing algorithms for matching a set of gapped patterns are worst-case efficient but not practical, or vice versa, in this particular case. The novel algorithms that we present are based on dynamic programming and bit-parallelism, and lie in a middle-ground among the existing algorithms. In fact, their time complexity is close to the best existing bound and, yet, they are also practical. We also provide experimental results which show that the presented algorithms are fast in practice, and preferable if all the strings in the patterns have unit-length.

preprint2014arXiv

New tabulation and sparse dynamic programming based techniques for sequence similarity problems

Calculating the length of a longest common subsequence (LCS) of two strings $A$ and $B$ of length $n$ and $m$ is a classic research topic, with many worst-case oriented results known. We present two algorithms for LCS length calculation with respectively $O(mn \log\log n / \log^2 n)$ and $O(mn / \log^2 n + r)$ time complexity, the latter working for $r = o(mn / (\log n \log\log n))$, where $r$ is the number of matches in the dynamic programming matrix. We also describe conditions for a given problem sufficient to apply our techniques, with several concrete examples presented, namely the edit distance, LCTS and MerLCS problems.

preprint2014arXiv

Sampling the suffix array with minimizers

Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared to the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which seems more convenient in practice. Experiments show that our algorithm achieves competitive time-space tradeoffs on most standard benchmark data.

preprint2013arXiv

Approximate pattern matching with k-mismatches in packed text

Given strings $P$ of length $m$ and $T$ of length $n$ over an alphabet of size $σ$, the string matching with $k$-mismatches problem is to find the positions of all the substrings in $T$ that are at Hamming distance at most $k$ from $P$. If $T$ can be read only one character at the time the best known bounds are $O(n\sqrt{k\log k})$ and $O(n + n\sqrt{k/w}\log k)$ in the word-RAM model with word length $w$. In the RAM models (including $AC^0$ and word-RAM) it is possible to read up to $\floor{w / \log σ}$ characters in constant time if the characters of $T$ are encoded using $\ceil{\log σ}$ bits. The only solution for $k$-mismatches in packed text works in $O((n \logσ/\log n)\ceil{m \log (k + \log n / \logσ) / w} + n^{\varepsilon})$ time, for any $\varepsilon > 0$. We present an algorithm that runs in time $O(\frac{n}{\floor{w/(m\logσ)}} (1 + \log \min(k,σ) \log m / \logσ))$ in the $AC^0$ model if $m=O(w / \logσ)$ and $T$ is given packed. We also describe a simpler variant that runs in time $O(\frac{n}{\floor{w/(m\logσ)}}\log \min(m, \log w / \logσ))$ in the word-RAM model. The algorithms improve the existing bound for $w = Ω(\log^{1+ε}n)$, for any $ε> 0$. Based on the introduced technique, we present algorithms for several other approximate matching problems.

preprint2013arXiv

Efficient algorithms for the longest common subsequence in $k$-length substrings

Finding the longest common subsequence in $k$-length substrings (LCS$k$) is a recently proposed problem motivated by computational biology. This is a generalization of the well-known LCS problem in which matching symbols from two sequences $A$ and $B$ are replaced with matching non-overlapping substrings of length $k$ from $A$ and $B$. We propose several algorithms for LCS$k$, being non-trivial incarnations of the major concepts known from LCS research (dynamic programming, sparse dynamic programming, tabulation). Our algorithms make use of a linear-time and linear-space preprocessing finding the occurrences of all the substrings of length $k$ from one sequence in the other sequence.

preprint2013arXiv

New algorithms for binary jumbled pattern matching

Given a pattern $P$ and a text $T$, both strings over a binary alphabet, the binary jumbled string matching problem consists in telling whether any permutation of $P$ occurs in $T$. The indexed version of this problem, i.e., preprocessing a string to efficiently answer such permutation queries, is hard and has been studied in the last few years. Currently the best bounds for this problem are $O(n^2/\log^2 n)$ (with O(n) space and O(1) query time) and $O(r^2\log r)$ (with O(|L|) space and $O(\log|L|)$ query time), where $r$ is the length of the run-length encoding of $T$ and $|L| = O(n)$ is the size of the index. In this paper we present new results for this problem. Our first result is an alternative construction of the index by Badkobeh et al. that obtains a trade-off between the space and the time complexity. It has $O(r^2\log k + n/k)$ complexity to build the index, $O(\log k)$ query time, and uses $O(n/k + |L|)$ space, where $k$ is a parameter. The second result is an $O(n^2 \log^2 w / w)$ algorithm (with O(n) space and O(1) query time), based on word-level parallelism where $w$ is the word size in bits.

preprint2011arXiv

Engineering Relative Compression of Genomes

Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over a order of magnitude greater. One of the new successful ideas is augmenting the reference sequence with phrases from the other sequences, making more LZ-matches available.

preprint2011arXiv

Tight and simple Web graph compression

Analysing Web graphs has applications in determining page ranks, fighting Web spam, detecting communities and mirror sites, and more. This study is however hampered by the necessity of storing a major part of huge graphs in the external memory, which prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus been presented, to represent Web graphs succinctly but also providing random access. Those techniques are usually based on differential encodings of the adjacency lists, finding repeating nodes or node regions in the successive lists, more general grammar-based transformations or 2-dimensional representations of the binary matrix of the graph. In this paper we present two Web graph compression algorithms. The first can be seen as engineering of the Boldi and Vigna (2004) method. We extend the notion of similarity between link lists, and use a more compact encoding of residuals. The algorithm works on blocks of varying size (in the number of input lines) and sacrifices access time for better compression ratio, achieving more succinct graph representation than other algorithms reported in the literature. The second algorithm works on blocks of the same size, in the number of input lines, and its key mechanism is merging the block into a single ordered list. This method achieves much more attractive space-time tradeoffs.

preprint2010arXiv

String Matching with Inversions and Translocations in Linear Average Time (Most of the Time)

We present an efficient algorithm for finding all approximate occurrences of a given pattern $p$ of length $m$ in a text $t$ of length $n$ allowing for translocations of equal length adjacent factors and inversions of factors. The algorithm is based on an efficient filtering method and has an $\bigO(nm\max(α, β))$-time complexity in the worst case and $\bigO(\max(α, β))$-space complexity, where $α$ and $β$ are respectively the maximum length of the factors involved in any translocation and inversion. Moreover we show that under the assumptions of equiprobability and independence of characters our algorithm has a $\bigO(n)$ average time complexity, whenever $σ= Ω(\log m / \log\log^{1-ε} m)$, where $ε> 0$ and $σ$ is the dimension of the alphabet. Experiments show that the new proposed algorithm achieves very good results in practical cases.

Szymon Grabowski

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

SOPanG 2: online searching over a pan-genome without false positives

A practical index for approximate dictionary matching with few mismatches

Rank and select: Another lesson learned

Suffix arrays with a twist

Two simple full-text indexes based on the suffix array

A bloated FM-index reducing the number of cache misses during the search

A Bloom filter based semi-index on $q$-grams

A note on the longest common Abelian factor problem

FM-index for dummies

A note on the longest common substring with $k$-mismatches problem

Disk-based genome sequencing data compression

Motif matching using gapped patterns

New tabulation and sparse dynamic programming based techniques for sequence similarity problems

Sampling the suffix array with minimizers

Approximate pattern matching with k-mismatches in packed text

Efficient algorithms for the longest common subsequence in $k$-length substrings

New algorithms for binary jumbled pattern matching

Engineering Relative Compression of Genomes

Tight and simple Web graph compression

String Matching with Inversions and Translocations in Linear Average Time (Most of the Time)