Source author record

Benjamin Sach

Benjamin Sach appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Complexity

Catalog footprint

What is connected

14works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2015arXiv

Dictionary matching in a stream

We consider the problem of dictionary matching in a stream. Given a set of strings, known as a dictionary, and a stream of characters arriving one at a time, the task is to report each time some string in our dictionary occurs in the stream. We present a randomised algorithm which takes O(log log(k + m)) time per arriving character and uses O(k log m) words of space, where k is the number of strings in the dictionary and m is the length of the longest string in the dictionary.

preprint2015arXiv

The complexity of computation in bit streams

We revisit the complexity of online computation in the cell probe model. We consider a class of problems where we are first given a fixed pattern or vector $F$ of $n$ symbols and then one symbol arrives at a time in a stream. After each symbol has arrived we must output some function of $F$ and the $n$-length suffix of the arriving stream. Cell probe bounds of $Ω(δ\lg{n}/w)$ have previously been shown for both convolution and Hamming distance in this setting, where $δ$ is the size of a symbol in bits and $w\inΩ(\lg{n})$ is the cell size in bits. However, when $δ$ is a constant, as it is in many natural situations, these previous results no longer give us non-trivial bounds. We introduce a new lop-sided information transfer proof technique which enables us to prove meaningful lower bounds even for constant size input alphabets. We use our new framework to prove an amortised cell probe lower bound of $Ω(\lg^2 n/(w\cdot \lg \lg n))$ time per arriving bit for an online version of a well studied problem known as pattern matching with address errors. This is the first non-trivial cell probe lower bound for any online problem on bit streams that still holds when the cell sizes are large. We also show the same bound for online convolution conditioned on a new combinatorial conjecture related to Toeplitz matrices.

preprint2015arXiv

The k-mismatch problem revisited

We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic $O(n k^2\log{k} / m+n \text{polylog} m)$ time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only $O(k^2\text{polylog} {m})$ space in total. 3) Next we give a randomised $(1+ε)$-approximation algorithm for the streaming k-mismatch problem which uses $O(k^2\text{polylog} m / ε^2)$ space and runs in $O(\text{polylog} m / ε^2)$ worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised $O(k^2\text{polylog} {m})$ space algorithm for the streaming k-mismatch problem which runs in $O(\sqrt{k}\log{k} + \text{polylog} {m})$ worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors).

preprint2014arXiv

Cell-Probe Bounds for Online Edit Distance and Other Pattern Matching Problems

We give cell-probe bounds for the computation of edit distance, Hamming distance, convolution and longest common subsequence in a stream. In this model, a fixed string of $n$ symbols is given and one $δ$-bit symbol arrives at a time in a stream. After each symbol arrives, the distance between the fixed string and a suffix of most recent symbols of the stream is reported. The cell-probe model is perhaps the strongest model of computation for showing data structure lower bounds, subsuming in particular the popular word-RAM model. * We first give an $Ω((δ\log n)/(w+\log\log n))$ lower bound for the time to give each output for both online Hamming distance and convolution, where $w$ is the word size. This bound relies on a new encoding scheme and for the first time holds even when $w$ is as small as a single bit. * We then consider the online edit distance and longest common subsequence problems in the bit-probe model ($w=1$) with a constant sized input alphabet. We give a lower bound of $Ω(\sqrt{\log n}/(\log\log n)^{3/2})$ which applies for both problems. This second set of results relies both on our new encoding scheme as well as a carefully constructed hard distribution. * Finally, for the online edit distance problem we show that there is an $O((\log n)^2/w)$ upper bound in the cell-probe model. This bound gives a contrast to our new lower bound and also establishes an exponential gap between the known cell-probe and RAM model complexities.

preprint2014arXiv

Time Bounds for Streaming Problems

We give tight cell-probe bounds for the time to compute convolution, multiplication and Hamming distance in a stream. The cell probe model is a particularly strong computational model and subsumes, for example, the popular word RAM model. We first consider online convolution where the task is to output the inner product between a fixed $n$-dimensional vector and a vector of the $n$ most recent values from a stream. One symbol of the stream arrives at a time and the each output must be computed before the next symbols arrives. Next we show bounds for online multiplication where the stream consists of pairs of digits, one from each of two $n$ digit numbers that are to be multiplied. One pair arrives at a time and the task is to output a single new digit from the product before the next pair of digits arrives. Finally we look at the online Hamming distance problem where the Hamming distance is outputted instead of the inner product. For each of these three problems, we give a lower bound of $Ω(\fracδ{w}\log n)$ time on average per output, where $δ$ is the number of bits needed to represent an input symbol and $w$ is the cell or word size. We argue that these bound are in fact tight within the cell probe model.

preprint2013arXiv

Fingerprints in Compressed Strings

The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string $S$ of size $N$ compressed by a context-free grammar of size $n$ that answers fingerprint queries. That is, given indices $i$ and $j$, the answer to a query is the fingerprint of the substring $S[i,j]$. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get $O(\log N)$ query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get $O(\log \log N)$ query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time $O(\log N \log \lce)$ and $O(\log \lce \log\log \lce + \log\log N)$ for SLPs and Linear SLPs, respectively. Here, $\lce$ denotes the length of the LCE.

preprint2013arXiv

Time-Space Trade-Offs for Longest Common Extensions

We revisit the longest common extension (LCE) problem, that is, preprocess a string $T$ into a compact data structure that supports fast LCE queries. An LCE query takes a pair $(i,j)$ of indices in $T$ and returns the length of the longest common prefix of the suffixes of $T$ starting at positions $i$ and $j$. We study the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case time for answering an LCE query. Let $n$ be the length of $T$. Given a parameter $τ$, $1 \leq τ\leq n$, we show how to achieve either $O(\infrac{n}{\sqrtτ})$ space and $O(τ)$ query time, or $O(\infrac{n}τ)$ space and $O(τ\log({|\LCE(i,j)|}/τ))$ query time, where $|\LCE(i,j)|$ denotes the length of the LCE returned by the query. These bounds provide the first smooth trade-offs for the LCE problem and almost match the previously known bounds at the extremes when $τ=1$ or $τ=n$. We apply the result to obtain improved bounds for several applications where the LCE problem is the computational bottleneck, including approximate string matching and computing palindromes. We also present an efficient technique to reduce LCE queries on two strings to one string. Finally, we give a lower bound on the time-space product for LCE data structures in the non-uniform cell probe model showing that our second trade-off is nearly optimal.

preprint2012arXiv

Parameterized Matching in the Streaming Model

We study the problem of parameterized matching in a stream where we want to output matches between a pattern of length m and the last m symbols of the stream before the next symbol arrives. Parameterized matching is a natural generalisation of exact matching where an arbitrary one-to-one relabelling of pattern symbols is allowed. We show how this problem can be solved in constant time per arriving stream symbol and sublinear, near optimal space with high probability. Our results are surprising and important: it has been shown that almost no streaming pattern matching problems can be solved (not even randomised) in less than Theta(m) space, with exact matching as the only known problem to have a sublinear, near optimal space solution. Here we demonstrate that a similar sublinear, near optimal space solution is achievable for an even more challenging problem. The proof is considerably more complex than that for exact matching.

preprint2012arXiv

Pattern Matching in Multiple Streams

We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.

preprint2012arXiv

Sparse Suffix Tree Construction with Small Space

We consider the problem of constructing a sparse suffix tree (or suffix array) for $b$ suffixes of a given text $T$ of size $n$, using only $O(b)$ words of space during construction time. Breaking the naive bound of $Ω(nb)$ time for this problem has occupied many algorithmic researchers since a different structure, the (evenly spaced) sparse suffix tree, was introduced by K{ä}rkk{ä}inen and Ukkonen in 1996. While in the evenly spaced sparse suffix tree the suffixes considered must be evenly spaced in $T$, here there is no constraint on the locations of the suffixes. We show that the sparse suffix tree can be constructed in $O(n\log^2b)$ time. To achieve this we develop a technique, which may be of independent interest, that allows to efficiently answer $b$ longest common prefix queries on suffixes of $T$, using only $O(b)$ space. We expect that this technique will prove useful in many other applications in which space usage is a concern. Furthermore, additional tradeoffs between the space usage and the construction time are given.

preprint2012arXiv

Tight Cell-Probe Bounds for Online Hamming Distance Computation

We show tight bounds for online Hamming distance computation in the cell-probe model with word size w. The task is to output the Hamming distance between a fixed string of length n and the last n symbols of a stream. We give a lower bound of Omega((d/w)*log n) time on average per output, where d is the number of bits needed to represent an input symbol. We argue that this bound is tight within the model. The lower bound holds under randomisation and amortisation.

preprint2011arXiv

Pattern Matching under Polynomial Transformation

We consider a class of pattern matching problems where a normalising transformation is applied at every alignment. Normalised pattern matching plays a key role in fields as diverse as image processing and musical information processing where application specific transformations are often applied to the input. By considering the class of polynomial transformations of the input, we provide fast algorithms and the first lower bounds for both new and old problems. Given a pattern of length m and a longer text of length n where both are assumed to contain integer values only, we first show O(n log m) time algorithms for pattern matching under linear transformations even when wildcard symbols can occur in the input. We then show how to extend the technique to polynomial transformations of arbitrary degree. Next we consider the problem of finding the minimum Hamming distance under polynomial transformation. We show that, for any epsilon>0, there cannot exist an O(n m^(1-epsilon)) time algorithm for additive and linear transformations conditional on the hardness of the classic 3SUM problem. Finally, we consider a version of the Hamming distance problem under additive transformations with a bound k on the maximum distance that need be reported. We give a deterministic O(nk log k) time solution which we then improve by careful use of randomisation to O(n sqrt(k log k) log n) time for sufficiently small k. Our randomised solution outputs the correct answer at every position with high probability.

preprint2011arXiv

Space Lower Bounds for Online Pattern Matching

We present space lower bounds for online pattern matching under a number of different distance measures. Given a pattern of length m and a text that arrives one character at a time, the online pattern matching problem is to report the distance between the pattern and a sliding window of the text as soon as the new character arrives. We require that the correct answer is given at each position with constant probability. We give Omega(m) bit space lower bounds for L_1, L_2, L_\infty, Hamming, edit and swap distances as well as for any algorithm that computes the cross-correlation/convolution. We then show a dichotomy between distance functions that have wildcard-like properties and those that do not. In the former case which includes, as an example, pattern matching with character classes, we give Omega(m) bit space lower bounds. For other distance functions, we show that there exist space bounds of Omega(log m) and O(log^2 m) bits. Finally we discuss space lower bounds for non-binary inputs and show how in some cases they can be improved.

preprint2011arXiv

The Complexity of Flood Filling Games

We study the complexity of the popular one player combinatorial game known as Flood-It. In this game the player is given an n by n board of tiles where each tile is allocated one of c colours. The goal is to make the colours of all tiles equal via the shortest possible sequence of flooding operations. In the standard version, a flooding operation consists of the player choosing a colour k, which then changes the colour of all the tiles in the monochromatic region connected to the top left tile to k. After this operation has been performed, neighbouring regions which are already of the chosen colour k will then also become connected, thereby extending the monochromatic region of the board. We show that finding the minimum number of flooding operations is NP-hard for c>=3 and that this even holds when the player can perform flooding operations from any position on the board. However, we show that this "free" variant is in P for c=2. We also prove that for an unbounded number of colours, Flood-It remains NP-hard for boards of height at least 3, but is in P for boards of height 2. Next we show how a c-1 approximation and a randomised 2c/3 approximation algorithm can be derived, and that no polynomial time constant factor, independent of c, approximation algorithm exists unless P=NP. We then investigate how many moves are required for the "most demanding" n by n boards (those requiring the most moves) and show that the number grows as fast as Theta(n*c^0.5). Finally, we consider boards where the colours of the tiles are chosen at random and show that for c>=2, the number of moves required to flood the whole board is Omega(n) with high probability.

Benjamin Sach

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Dictionary matching in a stream

The complexity of computation in bit streams

The k-mismatch problem revisited

Cell-Probe Bounds for Online Edit Distance and Other Pattern Matching Problems

Time Bounds for Streaming Problems

Fingerprints in Compressed Strings

Time-Space Trade-Offs for Longest Common Extensions

Parameterized Matching in the Streaming Model

Pattern Matching in Multiple Streams

Sparse Suffix Tree Construction with Small Space

Tight Cell-Probe Bounds for Online Hamming Distance Computation

Pattern Matching under Polynomial Transformation

Space Lower Bounds for Online Pattern Matching

The Complexity of Flood Filling Games