Source author record

Paweł Gawrychowski

Paweł Gawrychowski appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Formal Languages and Automata Theory Computational Complexity Computational Geometry Discrete Mathematics quant-ph

Catalog footprint

What is connected

35works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Cut query algorithms with star contraction

We study the complexity of determining the edge connectivity of a simple graph with cut queries. We show that (i) there is a bounded-error randomized algorithm that computes edge connectivity with $O(n)$ cut queries, and (ii) there is a bounded-error quantum algorithm that computes edge connectivity with $Õ(\sqrt{n})$ cut queries. We prove these results using a new technique called "star contraction" to randomly contract edges of a graph while preserving non-trivial minimum cuts. In star contraction vertices randomly contract an edge incident on a small set of randomly chosen vertices. In contrast to the related 2-out contraction technique of Ghaffari, Nowicki, and Thorup [SODA'20], star contraction only contracts vertex-disjoint star subgraphs, which allows it to be efficiently implemented via cut queries. The $O(n)$ bound from item (i) was not known even for the simpler problem of connectivity, and improves the $O(n\log^3 n)$ bound by Rubinstein, Schramm, and Weinberg [ITCS'18]. The bound is tight under the reasonable conjecture that the randomized communication complexity of connectivity is $Ω(n\log n)$, an open question since the seminal work of Babai, Frankl, and Simon [FOCS'86]. The bound also excludes using edge connectivity on simple graphs to prove a superlinear randomized query lower bound for minimizing a symmetric submodular function. Item (ii) gives a nearly-quadratic separation with the randomized complexity and addresses an open question of Lee, Santha, and Zhang [SODA'21]. The algorithm can also be viewed as making $Õ(\sqrt{n})$ matrix-vector multiplication queries to the adjacency matrix. Finally, we demonstrate the use of star contraction outside of the cut query setting by designing a one-pass semi-streaming algorithm for computing edge connectivity in the vertex arrival setting. This contrasts with the edge arrival setting where two passes are required.

preprint2022arXiv

Matching Patterns with Variables Under Edit Distance

A pattern $α$ is a string of variables and terminal letters. We say that $α$ matches a word $w$, consisting only of terminal letters, if $w$ can be obtained by replacing the variables of $α$ by terminal words. The matching problem, i.e., deciding whether a given pattern matches a given word, was heavily investigated: it is NP-complete in general, but can be solved efficiently for classes of patterns with restricted structure. If we are interested in what is the minimum Hamming distance between $w$ and any word $u$ obtained by replacing the variables of $α$ by terminal words (so matching under Hamming distance), one can devise efficient algorithms and matching conditional lower bounds for the class of regular patterns (in which no variable occurs twice), as well as for classes of patterns where we allow unbounded repetitions of variables, but restrict the structure of the pattern, i.e., the way the occurrences of different variables can be interleaved. Moreover, under Hamming distance, if a variable occurs more than once and its occurrences can be interleaved arbitrarily with those of other variables, even if each of these occurs just once, the matching problem is intractable. In this paper, we consider the problem of matching patterns with variables under edit distance. We still obtain efficient algorithms and matching conditional lower bounds for the class of regular patterns, but show that the problem becomes, in this case, intractable already for unary patterns, consisting of repeated occurrences of a single variable interleaved with terminals.

preprint2022arXiv

The Dynamic k-Mismatch Problem

The text-to-pattern Hamming distances problem asks to compute the Hamming distances between a given pattern of length $m$ and all length-$m$ substrings of a given text of length $n\ge m$. We focus on the $k$-mismatch version of the problem, where a distance needs to be returned only if it does not exceed a threshold $k$. We assume $n\le 2m$ (in general, one can partition the text into overlapping blocks). In this work, we show data structures for the dynamic version of this problem supporting two operations: An update performs a single-letter substitution in the pattern or the text, and a query, given an index $i$, returns the Hamming distance between the pattern and the text substring starting at position $i$, or reports that it exceeds $k$. First, we show a data structure with $\tilde{O}(1)$ update and $\tilde{O}(k)$ query time. Then we show that $\tilde{O}(k)$ update and $\tilde{O}(1)$ query time is also possible. These two provide an optimal trade-off for the dynamic $k$-mismatch problem with $k \le \sqrt{n}$: we prove that, conditioned on the strong 3SUM conjecture, one cannot simultaneously achieve $k^{1-Ω(1)}$ time for all operations. For $k\ge \sqrt{n}$, we give another lower bound, conditioned on the Online Matrix-Vector conjecture, that excludes algorithms taking $n^{1/2-Ω(1)}$ time per operation. This is tight for constant-sized alphabets: Clifford et al. (STACS 2018) achieved $\tilde{O}(\sqrt{n})$ time per operation in that case, but with $\tilde{O}(n^{3/4})$ time per operation for large alphabets. We improve and extend this result with an algorithm that, given $1\le x\le k$, achieves update time $\tilde{O}(\frac{n}{k} +\sqrt{\frac{nk}{x}})$ and query time $\tilde{O}(x)$. In particular, for $k\ge \sqrt{n}$, an appropriate choice of $x$ yields $\tilde{O}(\sqrt[3]{nk})$ time per operation, which is $\tilde{O}(n^{2/3})$ when no threshold $k$ is provided.

preprint2021arXiv

An Almost Optimal Edit Distance Oracle

We consider the problem of preprocessing two strings $S$ and $T$, of lengths $m$ and $n$, respectively, in order to be able to efficiently answer the following queries: Given positions $i,j$ in $S$ and positions $a,b$ in $T$, return the optimal alignment of $S[i \mathinner{.\,.} j]$ and $T[a \mathinner{.\,.} b]$. Let $N=mn$. We present an oracle with preprocessing time $N^{1+o(1)}$ and space $N^{1+o(1)}$ that answers queries in $\log^{2+o(1)}N$ time. In other words, we show that we can query the alignment of every two substrings in almost the same time it takes to compute just the alignment of $S$ and $T$. Our oracle uses ideas from our distance oracle for planar graphs [STOC 2019] and exploits the special structure of the alignment graph. Conditioned on popular hardness conjectures, this result is optimal up to subpolynomial factors. Our results apply to both edit distance and longest common subsequence (LCS). The best previously known oracle with construction time and size $\mathcal{O}(N)$ has slow $Ω(\sqrt{N})$ query time [Sakai, TCS 2019], and the one with size $N^{1+o(1)}$ and query time $\log^{2+o(1)}N$ (using a planar graph distance oracle) has slow $Ω(N^{3/2})$ construction time [Long & Pettie, SODA 2021]. We improve both approaches by roughly a $\sqrt N$ factor.

preprint2021arXiv

Conditional Lower Bounds for Variants of Dynamic LIS

In this note, we consider the complexity of maintaining the longest increasing subsequence (LIS) of an array under (i) inserting an element, and (ii) deleting an element of an array. We show that no algorithm can support queries and updates in time $\mathcal{O}(n^{1/2-ε})$ and $\mathcal{O}(n^{1/3-ε})$ for the dynamic LIS problem, for any constant $ε>0$, when the elements are weighted or the algorithm supports 1D-queries (on subarrays), respectively, assuming the All-Pairs Shortest Paths (APSP) conjecture or the Online Boolean Matrix-Vector Multiplication (OMv) conjecture. The main idea in our construction comes from the work of Abboud and Dahlgaard [FOCS 2016], who proved conditional lower bounds for dynamic planar graph algorithm. However, this needs to be appropriately adjusted and translated to obtain an instance of the dynamic LIS problem.

preprint2021arXiv

Fault-Tolerant Distance Labeling for Planar Graphs

In fault-tolerant distance labeling we wish to assign short labels to the vertices of a graph $G$ such that from the labels of any three vertices $u,v,f$ we can infer the $u$-to-$v$ distance in the graph $G\setminus \{f\}$. We show that any directed weighted planar graph (and in fact any graph in a graph family with $O(\sqrt{n})$-size separators, such as minor-free graphs) admits fault-tolerant distance labels of size $O(n^{2/3})$. We extend these labels in a way that allows us to also count the number of shortest paths, and provide additional upper and lower bounds for labels and oracles for counting shortest paths.

preprint2021arXiv

Strictly In-Place Algorithms for Permuting and Inverting Permutations

We revisit the problem of permuting an array of length $n$ according to a given permutation in place, that is, using only a small number of bits of extra storage. Fich, Munro and Poblete [FOCS 1990, SICOMP 1995] obtained an elegant $\mathcal{O}(n\log n)$-time algorithm using only $\mathcal{O}(\log^{2}n)$ bits of extra space for this basic problem by designing a procedure that scans the permutation and outputs exactly one element from each of its cycles. However, in the strict sense in place should be understood as using only an asymptotically optimal $\mathcal{O}(\log n)$ bits of extra space, or storing a constant number of indices. The problem of permuting in this version is, in fact, a well-known interview question, with the expected solution being a quadratic-time algorithm. Surprisingly, no faster algorithm seems to be known in the literature. Our first contribution is a strictly in-place generalisation of the method of Fich et al. that works in $\mathcal{O}_{\varepsilon}(n^{1+\varepsilon})$ time, for any $\varepsilon > 0$. Then, we build on this generalisation to obtain a strictly in-place algorithm for inverting a given permutation on $n$ elements working in the same complexity. This is a significant improvement on a recent result of Guśpiel [arXiv 2019], who designed an $\mathcal{O}(n^{1.5})$-time algorithm.

preprint2020arXiv

A Faster Subquadratic Algorithm for the Longest Common Increasing Subsequence Problem

The Longest Common Increasing Subsequence (LCIS) is a variant of the classical Longest Common Subsequence (LCS), in which we additionally require the common subsequence to be strictly increasing. While the well-known "Four Russians" technique can be used to find LCS in subquadratic time, it does not seem applicable to LCIS. Recently, Duraj [STACS 2020] used a completely different method based on the combinatorial properties of LCIS to design an $\mathcal{O}(n^2(\log\log n)^2/\log^{1/6}n)$ time algorithm. We show that an approach based on exploiting tabulation can be used to construct an asymptotically faster $\mathcal{O}(n^2 \log\log n/\sqrt{\log n})$ time algorithm. As our solution avoids using the specific combinatorial properties of LCIS, it can be also adapted for the Longest Common Weakly Increasing Subsequence (LCWIS).

preprint2020arXiv

A Note on a Recent Algorithm for Minimum Cut

Given an undirected edge-weighted graph $G=(V,E)$ with $m$ edges and $n$ vertices, the minimum cut problem asks to find a subset of vertices $S$ such that the total weight of all edges between $S$ and $V \setminus S$ is minimized. Karger's longstanding $O(m \log^3 n)$ time randomized algorithm for this problem was very recently improved in two independent works to $O(m \log^2 n)$ [ICALP'20] and to $O(m \log^2 n + n\log^5 n)$ [STOC'20]. These two algorithms use different approaches and techniques. In particular, while the former is faster, the latter has the advantage that it can be used to obtain efficient algorithms in the cut-query and in the streaming models of computation. In this paper, we show how to simplify and improve the algorithm of [STOC'20] to $O(m \log^2 n + n\log^3 n)$. We obtain this by replacing a randomized algorithm that, given a spanning tree $T$ of $G$, finds in $O(m \log n+n\log^4 n)$ time a minimum cut of $G$ that 2-respects (cuts two edges of) $T$ with a simple $O(m \log n+n\log^2 n)$ time deterministic algorithm for the same problem.

preprint2020arXiv

Efficient Labeling for Reachability in Digraphs

We consider labeling nodes of a directed graph for reachability queries. A reachability labeling scheme for such a graph assigns a binary string, called a label, to each node. Then, given the labels of nodes $u$ and $v$ and no other information about the underlying graph, it should be possible to determine whether there exists a directed path from $u$ to $v$. By a simple information theoretical argument and invoking the bound on the number of partial orders, in any scheme some labels need to consist of at least $n/4$ bits, where $n$ is the number of nodes. On the other hand, it is not hard to design a scheme with labels consisting of $n/2+O(\log n)$ bits. In the classical centralised setting, Munro and Nicholson designed a data structure for reachability queries consisting of $n^2/4+o(n^2)$ bits (which is optimal, up to the lower order term). We extend their approach to obtain a scheme with labels consisting of $n/3+o(n)$ bits.

preprint2020arXiv

Existential length universality

We study the following natural variation on the classical universality problem: given a language $L(M)$ represented by $M$ (e.g., a DFA/RE/NFA/PDA), does there exist an integer $\ell \geq 0$ such that $Σ^\ell \subseteq L(M)$? In the case of an NFA, we show that this problem is NEXPTIME-complete, and the smallest such $\ell$ can be doubly exponential in the number of states. This particular case was formulated as an open problem in 2009, and our solution uses a novel and involved construction. In the case of a PDA, we show that it is recursively unsolvable, while the smallest such $\ell$ is not bounded by any computable function of the number of states. In the case of a DFA, we show that the problem is NP-complete, and $e^{\sqrt{n \log n} (1+o(1))}$ is an asymptotically tight upper bound for the smallest such $\ell$, where $n$ is the number of states. Finally, we prove that in all these cases, the problem becomes computationally easier when the length $\ell$ is also given in binary in the input: it is polynomially solvable for a DFA, PSPACE-complete for an NFA, and co-NEXPTIME-complete for a PDA.

preprint2020arXiv

Generalised Pattern Matching Revisited

In the problem of $\texttt{Generalised Pattern Matching}\ (\texttt{GPM})$ [STOC'94, Muthukrishnan and Palem], we are given a text $T$ of length $n$ over an alphabet $Σ_T$, a pattern $P$ of length $m$ over an alphabet $Σ_P$, and a matching relationship $\subseteq Σ_T \times Σ_P$, and must return all substrings of $T$ that match $P$ (reporting) or the number of mismatches between each substring of $T$ of length $m$ and $P$ (counting). In this work, we improve over all previously known algorithms for this problem for various parameters describing the input instance: * $\mathcal{D}\,$ being the maximum number of characters that match a fixed character, * $\mathcal{S}\,$ being the number of pairs of matching characters, * $\mathcal{I}\,$ being the total number of disjoint intervals of characters that match the $m$ characters of the pattern $P$. At the heart of our new deterministic upper bounds for $\mathcal{D}\,$ and $\mathcal{S}\,$ lies a faster construction of superimposed codes, which solves an open problem posed in [FOCS'97, Indyk] and can be of independent interest. To conclude, we demonstrate first lower bounds for $\texttt{GPM}$. We start by showing that any deterministic or Monte Carlo algorithm for $\texttt{GPM}$ must use $Ω(\mathcal{S})$ time, and then proceed to show higher lower bounds for combinatorial algorithms. These bounds show that our algorithms are almost optimal, unless a radically new approach is developed.

preprint2020arXiv

Minimum Cut in $O(m\log^2 n)$ Time

We give a randomized algorithm that finds a minimum cut in an undirected weighted $m$-edge $n$-vertex graph $G$ with high probability in $O(m \log^2 n)$ time. This is the first improvement to Karger's celebrated $O(m \log^3 n)$ time algorithm from 1996. Our main technical contribution is a deterministic $O(m \log n)$ time algorithm that, given a spanning tree $T$ of $G$, finds a minimum cut of $G$ that 2-respects (cuts two edges of) $T$.

preprint2020arXiv

On Two Measures of Distance between Fully-Labelled Trees

The last decade brought a significant increase in the amount of data and a variety of new inference methods for reconstructing the detailed evolutionary history of various cancers. This brings the need of designing efficient procedures for comparing rooted trees representing the evolution of mutations in tumor phylogenies. Bernardini et al. [CPM 2019] recently introduced a notion of the rearrangement distance for fully-labelled trees motivated by this necessity. This notion originates from two operations: one that permutes the labels of the nodes, the other that affects the topology of the tree. Each operation alone defines a distance that can be computed in polynomial time, while the actual rearrangement distance, that combines the two, was proven to be NP-hard. We answer two open question left unanswered by the previous work. First, what is the complexity of computing the permutation distance? Second, is there a constant-factor approximation algorithm for estimating the rearrangement distance between two arbitrary trees? We answer the first one by showing, via a two-way reduction, that calculating the permutation distance between two trees on $n$ nodes is equivalent, up to polylogarithmic factors, to finding the largest cardinality matching in a sparse bipartite graph. In particular, by plugging in the algorithm of Liu and Sidford [ArXiv 2020], we obtain an $O(n^{4/3+o(1)})$ time algorithm for computing the permutation distance between two trees on $n$ nodes. Then we answer the second question positively, and design a linear-time constant-factor approximation algorithm that does not need any assumption on the trees.

preprint2020arXiv

Shorter Labels for Routing in Trees

A routing labeling scheme assigns a binary string, called a label, to each node in a network, and chooses a distinct port number from $\{1,\ldots,d\}$ for every edge outgoing from a node of degree $d$. Then, given the labels of $u$ and $w$ and no other information about the network, it should be possible to determine the port number corresponding to the first edge on the shortest path from $u$ to $w$. In their seminal paper, Thorup and Zwick [SPAA 2001] designed several routing methods for general weighted networks. An important technical ingredient in their paper that according to the authors ``may be of independent practical and theoretical interest'' is a routing labeling scheme for trees of arbitrary degrees. For a tree on $n$ nodes, their scheme constructs labels consisting of $(1+o(1))\log n$ bits such that the sought port number can be computed in constant time. Looking closer at their construction, the labels consist of $\log n + O(\log n\cdot \log\log\log n / \log\log n)$ bits. Given that the only known lower bound is $\log n+Ω(\log\log n)$, a natural question that has been asked for other labeling problems in trees is to determine the asymptotics of the smaller-order term. We make the first (and significant) progress in 19 years on determining the correct second-order term for the length of a label in a routing labeling scheme for trees on $n$ nodes. We design such a scheme with labels of length $\log n+O((\log\log n)^{2})$. Furthermore, we modify the scheme to allow for computing the port number in constant time at the expense of slightly increasing the length to $\log n+O((\log\log n)^{3})$.

preprint2020arXiv

Voronoi diagrams on planar graphs, and computing the diameter in deterministic $\tilde{O}(n^{5/3})$ time

We present an explicit and efficient construction of additively weighted Voronoi diagrams on planar graphs. Let $G$ be a planar graph with $n$ vertices and $b$ sites that lie on a constant number of faces. We show how to preprocess $G$ in $\tilde O(nb^2)$ time (footnote: The $\tilde O$ notation hides polylogarithmic factors.) so that one can compute any additively weighted Voronoi diagram for these sites in $\tilde O(b)$ time. We use this construction to compute the diameter of a directed planar graph with real arc lengths in $\tilde{O}(n^{5/3})$ time. This improves the recent breakthrough result of Cabello (SODA'17), both by improving the running time (from $\tilde{O}(n^{11/6})$), and by providing a deterministic algorithm. It is in fact the first truly subquadratic {\em deterministic} algorithm for this problem. Our use of Voronoi diagrams to compute the diameter follows that of Cabello, but he used abstract Voronoi diagrams, which makes his diameter algorithm more involved, more expensive, and randomized. As in Cabello's work, our algorithm can compute, for every vertex $v$, both the farthest vertex from $v$ (i.e., the eccentricity of $v$), and the sum of distances from $v$ to all other vertices. Hence, our algorithm can also compute the radius, median, and Wiener index (sum of all pairwise distances) of a planar graph within the same time bounds. Our construction of Voronoi diagrams for planar graphs is of independent interest.

preprint2016arXiv

A note on distance labeling in planar graphs

A distance labeling scheme is an assignments of labels, that is binary strings, to all nodes of a graph, so that the distance between any two nodes can be computed from their labels and the labels are as short as possible. A major open problem is to determine the complexity of distance labeling in unweighted and undirected planar graphs. It is known that, in such a graph on $n$ nodes, some labels must consist of $Ω(n^{1/3})$ bits, but the best known labeling scheme uses labels of length $O(\sqrt{n}\log n)$ [Gavoille, Peleg, Pérennes, and Raz, J. Algorithms, 2004]. We show that, in fact, labels of length $O(\sqrt{n})$ are enough.

preprint2016arXiv

Faster Longest Common Extension Queries in Strings over General Alphabets

Longest common extension queries (often called longest common prefix queries) constitute a fundamental building block in multiple string algorithms, for example computing runs and approximate pattern matching. We show that a sequence of $q$ LCE queries for a string of size $n$ over a general ordered alphabet can be realized in $O(q \log \log n+n\log^*n)$ time making only $O(q+n)$ symbol comparisons. Consequently, all runs in a string over a general ordered alphabet can be computed in $O(n \log \log n)$ time making $O(n)$ symbol comparisons. Our results improve upon a solution by Kosolobov (Information Processing Letters, 2016), who gave an algorithm with $O(n \log^{2/3} n)$ running time and conjectured that $O(n)$ time is possible. We make a significant progress towards resolving this conjecture. Our techniques extend to the case of general unordered alphabets, when the time increases to $O(q\log n + n\log^*n)$. The main tools are difference covers and the disjoint-sets data structure.

preprint2016arXiv

Improved Bounds for Shortest Paths in Dense Distance Graphs

We study the problem of computing shortest paths in so-called dense distance graphs. Every planar graph $G$ on $n$ vertices can be partitioned into a set of $O(n/r)$ edge-disjoint regions (called an $r$-division) with $O(r)$ vertices each, such that each region has $O(\sqrt{r})$ vertices (called boundary vertices) in common with other regions. A dense distance graph of a region is a complete graph containing all-pairs distances between its boundary nodes. A dense distance graph of an $r$-division is the union of the $O(n/r)$ dense distance graphs of the individual pieces. Since the introduction of dense distance graphs by Fakcharoenphol and Rao, computing single-source shortest paths in dense distance graphs has found numerous applications in fundamental planar graph algorithms. Fakcharoenphol and Rao proposed an algorithm (later called FR-Dijkstra) for computing single-source shortest paths in a dense distance graph in $O\left(\frac{n}{\sqrt{r}}\log{n}\log{r}\right)$ time. We show an $O\left(\frac{n}{\sqrt{r}}\left(\frac{\log^2{r}}{\log^2\log{r}}+\log{n}\log^ε{r}\right)\right)$ time algorithm for this problem, which is the first improvement to date over FR-Dijkstra for the important case when $r$ is polynomial in $n$. In this case, our algorithm is faster by a factor of $O(\log^2{\log{n}})$ and implies improved upper bounds for such planar graph problems as multiple-source multiple-sink maximum flow, single-source all-sinks maximum flow, and (dynamic) exact distance oracles.

preprint2016arXiv

Optimal Dynamic Strings

In this paper we study the fundamental problem of maintaining a dynamic collection of strings under the following operations: concat - concatenates two strings, split - splits a string into two at a given position, compare - finds the lexicographical order (less, equal, greater) between two strings, LCP - calculates the longest common prefix of two strings. We present an efficient data structure for this problem, where an update requires only $O(\log n)$ worst-case time with high probability, with $n$ being the total length of all strings in the collection, and a query takes constant worst-case time. On the lower bound side, we prove that even if the only possible query is checking equality of two strings, either updates or queries take amortized $Ω(\log n)$ time; hence our implementation is optimal. Such operations can be used as a basic building block to solve other string problems. We provide two examples. First, we can augment our data structure to provide pattern matching queries that may locate occurrences of a specified pattern $p$ in the strings in our collection in optimal $O(|p|)$ time, at the expense of increasing update time to $O(\log^2 n)$. Second, we show how to maintain a history of an edited text, processing updates in $O(\log t \log \log t)$ time, where $t$ is the number of edits, and how to support pattern matching queries against the whole history in $O(|p| \log t \log \log t)$ time. Finally, we note that our data structure can be applied to test dynamic tree isomorphism and to compare strings generated by dynamic straight-line grammars.

preprint2016arXiv

Randomized algorithms for finding a majority element

Given $n$ colored balls, we want to detect if more than $\lfloor n/2\rfloor$ of them have the same color, and if so find one ball with such majority color. We are only allowed to choose two balls and compare their colors, and the goal is to minimize the total number of such operations. A well-known exercise is to show how to find such a ball with only $2n$ comparisons while using only a logarithmic number of bits for bookkeeping. The resulting algorithm is called the Boyer--Moore majority vote algorithm. It is known that any deterministic method needs $\lceil 3n/2\rceil-2$ comparisons in the worst case, and this is tight. However, it is not clear what is the required number of comparisons if we allow randomization. We construct a randomized algorithm which always correctly finds a ball of the majority color (or detects that there is none) using, with high probability, only $7n/6+o(n)$ comparisons. We also prove that the expected number of comparisons used by any such randomized method is at least $1.019n$.

preprint2016arXiv

Sparse Suffix Tree Construction in Optimal Time and Space

Suffix tree (and the closely related suffix array) are fundamental structures capturing all substrings of a given text essentially by storing all its suffixes in the lexicographical order. In some applications, we work with a subset of $b$ interesting suffixes, which are stored in the so-called sparse suffix tree. Because the size of this structure is $Θ(b)$, it is natural to seek a construction algorithm using only $O(b)$ words of space assuming read-only random access to the text. We design a linear-time Monte Carlo algorithm for this problem, hence resolving an open question explicitly stated by Bille et al. [TALG 2016]. The best previously known algorithm by I et al. [STACS 2014] works in $O(n\log b)$ time. Our solution proceeds in $n/b$ rounds; in the $r$-th round, we consider all suffixes starting at positions congruent to $r$ modulo $n/b$. By maintaining rolling hashes, we lexicographically sort all interesting suffixes starting at such positions, and then we merge them with the already considered suffixes. For efficient merging, we also need to answer LCE queries in small space. By plugging in the structure of Bille et al. [CPM 2015] we obtain $O(n+b\log b)$ time complexity. We improve this structure, which implies a linear-time sparse suffix tree construction algorithm. We complement our Monte Carlo algorithm with a deterministic verification procedure. The verification takes $O(n\sqrt{\log b})$ time, which improves upon the bound of $O(n\log b)$ obtained by I et al. [STACS 2014]. This is obtained by first observing that the pruning done inside the previous solution has a rather clean description using the notion of graph spanners with small multiplicative stretch. Then, we are able to decrease the verification time by applying difference covers twice. Combined with the Monte Carlo algorithm, this gives us an $O(n\sqrt{\log b})$-time and $O(b)$-space Las Vegas algorithm.

preprint2016arXiv

Sublinear-Space Distance Labeling using Hubs

A distance labeling scheme is an assignment of bit-labels to the vertices of an undirected, unweighted graph such that the distance between any pair of vertices can be decoded solely from their labels. We propose a series of new labeling schemes within the framework of so-called hub labeling (HL, also known as landmark labeling or 2-hop-cover labeling), in which each node $u$ stores its distance to all nodes from an appropriately chosen set of hubs $S(u) \subseteq V$. For a queried pair of nodes $(u,v)$, the length of a shortest $u-v$-path passing through a hub node from $S(u)\cap S(v)$ is then used as an upper bound on the distance between $u$ and $v$. We present a hub labeling which allows us to decode exact distances in sparse graphs using labels of size sublinear in the number of nodes. For graphs with at most $n$ nodes and average degree $Δ$, the tradeoff between label bit size $L$ and query decoding time $T$ for our approach is given by $L = O(n \log \log_ΔT / \log_ΔT)$, for any $T \leq n$. Our simple approach is thus the first sublinear-space distance labeling for sparse graphs that simultaneously admits small decoding time (for constant $Δ$, we can achieve any $T=ω(1)$ while maintaining $L=o(n)$), and it also provides an improvement in terms of label size with respect to previous slower approaches. By using similar techniques, we then present a $2$-additive labeling scheme for general graphs, i.e., one in which the decoder provides a 2-additive-approximation of the distance between any pair of nodes. We achieve almost the same label size-time tradeoff $L = O(n \log^2 \log T / \log T)$, for any $T \leq n$. To our knowledge, this is the first additive scheme with constant absolute error to use labels of sublinear size. The corresponding decoding time is then small (any $T=ω(1)$ is sufficient).

preprint2016arXiv

Tight tradeoffs for approximating palindromes in streams

We consider computing the longest palindrome in a text of length $n$ in the streaming model, where the characters arrive one-by-one, and we do not have random access to the input. While computing the answer exactly using sublinear memory is not possible in such a setting, one can still hope for a good approximation guarantee. We focus on the two most natural variants, where we aim for either additive or multiplicative approximation of the length of the longest palindrome. We first show that there is no point in considering Las Vegas algorithms in such a setting, as they cannot achieve sublinear space complexity. For Monte Carlo algorithms, we provide a lowerbound of $Ω(\frac{n}{E})$ bits for approximating the answer with additive error $E$, and $Ω(\frac{\log n}{\log(1+\varepsilon)})$ bits for approximating the answer with multiplicative error $(1+\varepsilon)$ for the binary alphabet. Then, we construct a generic Monte Carlo algorithm, which by choosing the parameters appropriately achieves space complexity matching up to a logarithmic factor for both variants. This substantially improves the previous results by Berenbrink et al. (STACS 2014) and essentially settles the space complexity.

preprint2016arXiv

Tight Tradeoffs for Real-Time Approximation of Longest Palindromes in Streams

We consider computing a longest palindrome in the streaming model, where the symbols arrive one-by-one and we do not have random access to the input. While computing the answer exactly using sublinear space is not possible in such a setting, one can still hope for a good approximation guarantee. Our contribution is twofold. First, we provide lower bounds on the space requirements for randomized approximation algorithms processing inputs of length $n$. We rule out Las Vegas algorithms, as they cannot achieve sublinear space complexity. For Monte Carlo algorithms, we prove a lower bounds of $Ω( M \log\min\{|Σ|,M\})$ bits of memory; here $M=n/E$ for approximating the answer with additive error $E$, and $M= \frac{\log n}{\log (1+\varepsilon)}$ for approximating the answer with multiplicative error $(1 + \varepsilon)$. Second, we design three real-time algorithms for this problem. Our Monte Carlo approximation algorithms for both additive and multiplicative versions of the problem use $O(M)$ words of memory. Thus the obtained lower bounds are asymptotically tight up to a logarithmic factor. The third algorithm is deterministic and finds a longest palindrome exactly if it is short. This algorithm can be run in parallel with a Monte Carlo algorithm to obtain better results in practice. Overall, both the time and space complexity of finding a longest palindrome in a stream are essentially settled.

preprint2015arXiv

Approximating LZ77 via Small-Space Multiple-Pattern Matching

We generalize Karp-Rabin string matching to handle multiple patterns in $\mathcal{O}(n \log n + m)$ time and $\mathcal{O}(s)$ space, where $n$ is the length of the text and $m$ is the total length of the $s$ patterns, returning correct answers with high probability. As a prime application of our algorithm, we show how to approximate the LZ77 parse of a string of length $n$. If the optimal parse consists of $z$ phrases, using only $\mathcal{O}(z)$ working space we can return a parse consisting of at most $(1+\varepsilon)z$ phrases in $\mathcal{O}(\varepsilon^{-1}n\log n)$ time, for any $\varepsilon\in (0,1]$. As previous quasilinear-time algorithms for LZ77 use $Ω(n/\textrm{polylog }n)$ space, but $z$ can be exponentially small in $n$, these improvements in space are substantial.

preprint2015arXiv

Efficiently Finding All Maximal $α$-gapped Repeats

For $α\geq 1$, an $α$-gapped repeat in a word $w$ is a factor $uvu$ of $w$ such that $|uv|\leq α|u|$; the two factors $u$ in such a repeat are called arms, while the factor $v$ is called gap. Such a repeat is called maximal if its arms cannot be extended simultaneously with the same symbol to the right or, respectively, to the left. In this paper we show that the number of maximal $α$-gapped repeats that may occur in a word is upper bounded by $18αn$. This allows us to construct an algorithm finding all the maximal $α$-gapped repeats of a word in $O(αn)$; this is optimal, in the worst case, as there are words that have $Θ(αn)$ maximal $α$-gapped repeats. Our techniques can be extended to get comparable results in the case of $α$-gapped palindromes, i.e., factors $uvu^\mathrm{T}$ with $|uv|\leq α|u|$.

preprint2015arXiv

Wavelet Trees Meet Suffix Trees

We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size $σ\leq n$, our method builds the wavelet tree in $O(n \log σ/ \sqrt{\log{n}})$ time, improving upon the state-of-the-art algorithm by a factor of $\sqrt{\log n}$. As a consequence, given an array of n integers we can construct in $O(n \sqrt{\log n})$ time a data structure consisting of $O(n)$ machine words and capable of answering rank/select queries for the subranges of the array in $O(\log n / \log \log n)$ time. This is a $\log \log n$-factor improvement in query time compared to Chan and Pătraşcu and a $\sqrt{\log n}$-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies $O(n)$ words, takes $O(n \sqrt{\log n})$ time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in $O(\log |x|)$ time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in $O(s \log |x|)$ time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.

preprint2014arXiv

Queries on LZ-Bounded Encodings

We describe a data structure that stores a string $S$ in space similar to that of its Lempel-Ziv encoding and efficiently supports access, rank and select queries. These queries are fundamental for implementing succinct and compressed data structures, such as compressed trees and graphs. We show that our data structure can be built in a scalable manner and is both small and fast in practice compared to other data structures supporting such queries.

preprint2013arXiv

Heaviest Induced Ancestors and Longest Common Substrings

Suppose we have two trees on the same set of leaves, in which nodes are weighted such that children are heavier than their parents. We say a node from the first tree and a node from the second tree are induced together if they have a common leaf descendant. In this paper we describe data structures that efficiently support the following heaviest-induced-ancestor query: given a node from the first tree and a node from the second tree, find an induced pair of their ancestors with maximum combined weight. Our solutions are based on a geometric interpretation that enables us to find heaviest induced ancestors using range queries. We then show how to use these results to build an LZ-compressed index with which we can quickly find with high probability a longest substring common to the indexed string and a given pattern.

preprint2013arXiv

Substring Suffix Selection

We study the following substring suffix selection problem: given a substring of a string T of length n, compute its k-th lexicographically smallest suffix. This a natural generalization of the well-known question of computing the maximal suffix of a string, which is a basic ingredient in many other problems. We first revisit two special cases of the problem, introduced by Babenko, Kolesnichenko and Starikovskaya [CPM'13], in which we are asked to compute the minimal non-empty and the maximal suffixes of a substring. For the maximal suffixes problem, we give a linear-space structure with O(1) query time and linear preprocessing time, i.e., we manage to achieve optimal construction and optimal query time simultaneously. For the minimal suffix problem, we give a linear-space data structure with O(τ) query time and O(n log n / τ) preprocessing time, where 1 <= τ<= log n is a parameter of the data structure. As a sample application, we show that this data structure can be used to compute the Lyndon decomposition of any substring of T in O(k τ) time, where k is the number of distinct factors in the decomposition. Finally, we move to the general case of the substring suffix selection problem, where using any combinatorial properties seems more difficult. Nevertheless, we develop a linear-space data structure with O(log^{2+ε} n) query time.

preprint2012arXiv

A Faster Grammar-Based Self-Index

To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with $r$ rules for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can store a self-index for $S$ in $\Oh{r + z \log \log n}$ space such that, given a pattern (P [1..m]), we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh{m^2 + \occ \log \log n}$ time. If the straight-line program is balanced and we accept a small probability of building a faulty index, then we can reduce the $\Oh{m^2}$ term to $\Oh{m \log m}$. All previous self-indexes are larger or slower in the worst case.

preprint2012arXiv

Faster Approximate Pattern Matching in Compressed Repetitive Texts

Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with $r$ rules for a string $s$ of length $n$, we can build an $\Oh{r}$-word data structure that allows us to extract any substring of length $m$ in $\Oh{\log n + m}$ time. They also showed how, given a pattern $p$ of length $m$ and an edit distance (k \leq m), their data structure supports finding all \occ approximate matches to $p$ in $s$ in $\Oh{r (\min (m k, k^4 + m) + \log n) + \occ}$ time. Rytter (2003) and Charikar et al. (2005) showed that $r$ is always at least the number $z$ of phrases in the LZ77 parse of $s$, and gave algorithms for building straight-line programs with $\Oh{z \log n}$ rules. In this paper we give a simple $\Oh{z \log n}$-word data structure that takes the same time for substring extraction but only $\Oh{z \min (m k, k^4 + m) + \occ}$ time for approximate pattern matching.

preprint2012arXiv

Linear-Space Substring Range Counting over Polylogarithmic Alphabets

Bille and Gørtz (2011) recently introduced the problem of substring range counting, for which we are asked to store compactly a string $S$ of $n$ characters with integer labels in ([0, u]), such that later, given an interval ([a, b]) and a pattern $P$ of length $m$, we can quickly count the occurrences of $P$ whose first characters' labels are in ([a, b]). They showed how to store $S$ in $\Oh{n \log n / \log \log n}$ space and answer queries in $\Oh{m + \log \log u}$ time. We show that, if $S$ is over an alphabet of size (\polylog (n)), then we can achieve optimal linear space. Moreover, if (u = n \polylog (n)), then we can also reduce the time to $\Oh{m}$. Our results give linear space and time bounds for position-restricted substring counting and the counting versions of indexing substrings with intervals, indexing substrings with gaps and aligned pattern matching.

preprint2011arXiv

On minimising automata with errors

The problem of k-minimisation for a DFA M is the computation of a smallest DFA N (where the size |M| of a DFA M is the size of the domain of the transition function) such that their recognized languages differ only on words of length less than k. The previously best algorithm, which runs in time O(|M| log^2 n) where n is the number of states, is extended to DFAs with partial transition functions. Moreover, a faster O(|M| log n) algorithm for DFAs that recognise finite languages is presented. In comparison to the previous algorithm for total DFAs, the new algorithm is much simpler and allows the calculation of a k-minimal DFA for each k in parallel. Secondly, it is demonstrated that calculating the least number of introduced errors is hard: Given a DFA M and numbers k and m, it is NP-hard to decide whether there exists a k-minimal DFA N differing from DFA M on at most m words. A similar result holds for hyper-minimisation of DFAs in general: Given a DFA M and numbers s and m, it is NP-hard to decide whether there exists a DFA N with at most s states such that DFA M and N differ on at msot m words.

Paweł Gawrychowski

What is connected

Connect this record

See the researcher in context

Building this map preview

35 published item(s)

Cut query algorithms with star contraction

Matching Patterns with Variables Under Edit Distance

The Dynamic k-Mismatch Problem

An Almost Optimal Edit Distance Oracle

Conditional Lower Bounds for Variants of Dynamic LIS

Fault-Tolerant Distance Labeling for Planar Graphs

Strictly In-Place Algorithms for Permuting and Inverting Permutations

A Faster Subquadratic Algorithm for the Longest Common Increasing Subsequence Problem

A Note on a Recent Algorithm for Minimum Cut

Efficient Labeling for Reachability in Digraphs

Existential length universality

Generalised Pattern Matching Revisited

Minimum Cut in $O(m\log^2 n)$ Time

On Two Measures of Distance between Fully-Labelled Trees

Shorter Labels for Routing in Trees

Voronoi diagrams on planar graphs, and computing the diameter in deterministic $\tilde{O}(n^{5/3})$ time

A note on distance labeling in planar graphs

Faster Longest Common Extension Queries in Strings over General Alphabets

Improved Bounds for Shortest Paths in Dense Distance Graphs

Optimal Dynamic Strings

Randomized algorithms for finding a majority element

Sparse Suffix Tree Construction in Optimal Time and Space

Sublinear-Space Distance Labeling using Hubs

Tight tradeoffs for approximating palindromes in streams

Tight Tradeoffs for Real-Time Approximation of Longest Palindromes in Streams

Approximating LZ77 via Small-Space Multiple-Pattern Matching

Efficiently Finding All Maximal $α$-gapped Repeats

Wavelet Trees Meet Suffix Trees

Queries on LZ-Bounded Encodings

Heaviest Induced Ancestors and Longest Common Substrings

Substring Suffix Selection

A Faster Grammar-Based Self-Index

Faster Approximate Pattern Matching in Compressed Repetitive Texts

Linear-Space Substring Range Counting over Polylogarithmic Alphabets

On minimising automata with errors