Researcher profile

Dominik Köppl

Dominik Köppl contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

Computing Longest (Common) Lyndon Subsequences

Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two strings of length $n$ in $O(n^4 σ)$ time using $O(n^3)$ space.

preprint2022arXiv

Computing NP-hard Repetitiveness Measures via MAX-SAT

Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors.

preprint2022arXiv

Computing the Parameterized Burrows--Wheeler Transform Online

Parameterized strings are a generalization of strings in that their characters are drawn from two different alphabets, where one is considered to be the alphabet of static characters and the other to be the alphabet of parameter characters. Two parameterized strings are a parameterized match if there is a bijection over all characters such that the bijection transforms one string to the other while keeping the static characters (i.e., it behaves as the identity on the static alphabet). Ganguly et al. [SODA 2017] proposed the parameterized Burrows--Wheeler transform (pBWT) as a variant of the Burrows--Wheeler transform for space-efficient parameterized pattern matching. In this paper, we propose an algorithm for computing the pBWT online by reading the characters of a given input string one-by-one from right to left. Our algorithm works in $O(|Π| \log n / \log \log n)$ amortized time for each input character, where $n$ and $Π$ denote the size of the input string and the alphabet of the parameter characters, respectively.

preprint2022arXiv

Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices

As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments show that, as a compressor, our tool is clearly superior to gzip and it is usually within 20% of xz in terms of compression ratio. In addition, our compressed format supports matrix-vector multiplications in time and space proportional to the size of the compressed representation, unlike gzip and xz that require the full decompression of the compressed matrix. To our knowledge our lossless compressor is the first one achieving time and space complexities which match the theoretical limit expressed by the $k$-th order statistical entropy of the input. To achieve further time/space reductions, we propose column-reordering algorithms hinging on a novel column-similarity score. Our experiments on various data sets of ML matrices show that, with a modest preprocessing time, our column reordering can yield a further reduction of up to 16% in the peak memory usage during matrix-vector multiplication. Finally, we compare our proposal against the state-of-the-art Compressed Linear Algebra (CLA) approach showing that ours runs always at least twice faster (in a multi-thread setting) and achieves better compressed space occupancy for most of the tested data sets. This experimentally confirms the provably effective theoretical bounds we show for our compressed-matrix approach.

preprint2021arXiv

PHONI: Streamed Matching Statistics with Multi-Genome References

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.

preprint2020arXiv

Dynamic Path-Decomposed Tries

A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic. In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Experiments on real-world datasets reveal that our dynamic keyword dictionary needs up to 68% less space than the existing smallest ones, while achieving a relevant space-time tradeoff.

preprint2020arXiv

Grammar-compressed Self-index with Lyndon Words

We introduce a new class of straight-line programs (SLPs), named the Lyndon SLP, inspired by the Lyndon trees (Barcelo, 1990). Based on this SLP, we propose a self-index data structure of $O(g)$ words of space that can be built from a string $T$ in $O(n \lg n)$ expected time, retrieving the starting positions of all occurrences of a pattern $P$ of length $m$ in $O(m + \lg m \lg n + occ \lg g)$ time, where $n$ is the length of $T$, $g$ is the size of the Lyndon SLP for $T$, and $occ$ is the number of occurrences of $P$ in $T$.

preprint2020arXiv

In-Place Bijective Burrows-Wheeler Transforms

One of the most well-known variants of the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994] is the bijective BWT (BBWT) [Gil and Scott, arXiv 2012], which applies the extended BWT (EBWT) [Mantaci et al., TCS 2007] to the multiset of Lyndon factors of a given text. Since the EBWT is invertible, the BBWT is a bijective transform in the sense that the inverse image of the EBWT restores this multiset of Lyndon factors such that the original text can be obtained by sorting these factors in non-increasing order. In this paper, we present algorithms constructing or inverting the BBWT in-place using quadratic time. We also present conversions from the BBWT to the BWT, or vice versa, either (a) in-place using quadratic time, or (b) in the run-length compressed setting using $O(n \lg r / \lg \lg r)$ time with $O(r \lg n)$ bits of words, where $r$ is the sum of character runs in the BWT and the BBWT.

preprint2020arXiv

Space-Efficient Algorithms for Computing Minimal/Shortest Unique Substrings

Given a string $T$ of length $n$, a substring $u = T[i..j]$ of $T$ is called a shortest unique substring (SUS) for an interval $[s,t]$ if (a) $u$ occurs exactly once in $T$, (b) $u$ contains the interval $[s,t]$ (i.e. $i \leq s \leq t \leq j$), and (c) every substring $v$ of $T$ with $|v| < |u|$ containing $[s,t]$ occurs at least twice in $T$. Given a query interval $[s, t] \subset [1, n]$, the interval SUS problem is to output all the SUSs for the interval $[s,t]$. In this article, we propose a $4n + o(n)$ bits data structure answering an interval SUS query in output-sensitive $O(\mathit{occ})$ time, where $\mathit{occ}$ is the number of returned SUSs. Additionally, we focus on the point SUS problem, which is the interval SUS problem for $s = t$. Here, we propose a $\lceil (\log_2{3} + 1)n \rceil + o(n)$ bits data structure answering a point SUS query in the same output-sensitive time. We also propose space-efficient algorithms for computing the minimal unique substrings of $T$.