Source author record

Tatiana Starikovskaya

Tatiana Starikovskaya appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms

Catalog footprint

What is connected

15works

1topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

An Improved Algorithm for The $k$-Dyck Edit Distance Problem

A Dyck sequence is a sequence of opening and closing parentheses (of various types) that is balanced. The Dyck edit distance of a given sequence of parentheses $S$ is the smallest number of edit operations (insertions, deletions, and substitutions) needed to transform $S$ into a Dyck sequence. We consider the threshold Dyck edit distance problem, where the input is a sequence of parentheses $S$ and a positive integer $k$, and the goal is to compute the Dyck edit distance of $S$ only if the distance is at most $k$, and otherwise report that the distance is larger than $k$. Backurs and Onak [PODS'16] showed that the threshold Dyck edit distance problem can be solved in $O(n+k^{16})$ time. In this work, we design new algorithms for the threshold Dyck edit distance problem which costs $O(n+k^{4.544184})$ time with high probability or $O(n+k^{4.853059})$ deterministically. Our algorithms combine several new structural properties of the Dyck edit distance problem, a refined algorithm for fast $(\min,+)$ matrix product, and a careful modification of ideas used in Valiant's parsing algorithm.

preprint2022arXiv

Pattern matching under DTW distance

In this work, we consider the problem of pattern matching under the dynamic time warping (DTW) distance motivated by potential applications in the analysis of biological data produced by the third generation sequencing. To measure the DTW distance between two strings, one must "warp" them, that is, double some letters in the strings to obtain two equal-lengths strings, and then sum the distances between the letters in the corresponding positions. When the distances between letters are integers, we show that for a pattern P with m runs and a text T with n runs: 1. There is an O(m + n)-time algorithm that computes all locations where the DTW distance from P to T is at most 1; 2. There is an O(kmn)-time algorithm that computes all locations where the DTW distance from P to T is at most k. As a corollary of the second result, we also derive an approximation algorithm for general metrics on the alphabet.

preprint2020arXiv

Approximating longest common substring with $k$ mismatches: Theory and practice

In the problem of the longest common substring with $k$ mismatches we are given two strings $X, Y$ and must find the maximal length $\ell$ such that there is a length-$\ell$ substring of $X$ and a length-$\ell$ substring of $Y$ that differ in at most $k$ positions. The length $\ell$ can be used as a robust measure of similarity between $X, Y$. In this work, we develop new approximation algorithms for computing $\ell$ that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound.

preprint2020arXiv

Generalised Pattern Matching Revisited

In the problem of $\texttt{Generalised Pattern Matching}\ (\texttt{GPM})$ [STOC'94, Muthukrishnan and Palem], we are given a text $T$ of length $n$ over an alphabet $Σ_T$, a pattern $P$ of length $m$ over an alphabet $Σ_P$, and a matching relationship $\subseteq Σ_T \times Σ_P$, and must return all substrings of $T$ that match $P$ (reporting) or the number of mismatches between each substring of $T$ of length $m$ and $P$ (counting). In this work, we improve over all previously known algorithms for this problem for various parameters describing the input instance: * $\mathcal{D}\,$ being the maximum number of characters that match a fixed character, * $\mathcal{S}\,$ being the number of pairs of matching characters, * $\mathcal{I}\,$ being the total number of disjoint intervals of characters that match the $m$ characters of the pattern $P$. At the heart of our new deterministic upper bounds for $\mathcal{D}\,$ and $\mathcal{S}\,$ lies a faster construction of superimposed codes, which solves an open problem posed in [FOCS'97, Indyk] and can be of independent interest. To conclude, we demonstrate first lower bounds for $\texttt{GPM}$. We start by showing that any deterministic or Monte Carlo algorithm for $\texttt{GPM}$ must use $Ω(\mathcal{S})$ time, and then proceed to show higher lower bounds for combinatorial algorithms. These bounds show that our algorithms are almost optimal, unless a radically new approach is developed.

preprint2016arXiv

Approximate Hamming distance in a stream

We consider the problem of computing a $(1+ε)$-approximation of the Hamming distance between a pattern of length $n$ and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem, giving Alice the first half of the stream and Bob the second half. We show the following: (1) If Alice and Bob both share the pattern then there is an $O(ε^{-4} \log^2 n)$ bit randomised one-way communication protocol. (2) If only Alice has the pattern then there is an $O(ε^{-2}\sqrt{n}\log n)$ bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for $(1+ε)$-approximate Hamming distance which give worst case running time guarantees per arriving symbol. (1) For binary input alphabets there is an $O(ε^{-3} \sqrt{n} \log^{2} n)$ space and $O(ε^{-2} \log{n})$ time streaming $(1+ε)$-approximate Hamming distance algorithm. (2) For general input alphabets there is an $O(ε^{-5} \sqrt{n} \log^{4} n)$ space and $O(ε^{-4} \log^3 {n})$ time streaming $(1+ε)$-approximate Hamming distance algorithm.

preprint2015arXiv

Dictionary matching in a stream

We consider the problem of dictionary matching in a stream. Given a set of strings, known as a dictionary, and a stream of characters arriving one at a time, the task is to report each time some string in our dictionary occurs in the stream. We present a randomised algorithm which takes O(log log(k + m)) time per arriving character and uses O(k log m) words of space, where k is the number of strings in the dictionary and m is the length of the longest string in the dictionary.

preprint2015arXiv

On Maximal Unbordered Factors

Given a string $S$ of length $n$, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between $n$ and the length of the maximal unbordered factor of $S$. We prove that for the alphabet of size $σ\ge 5$ the expected length of the maximal unbordered factor of a string of length~$n$ is at least $0.99 n$ (for sufficiently large values of $n$). As an application of this result, we propose a new algorithm for computing the maximal unbordered factor of a string.

preprint2015arXiv

The k-mismatch problem revisited

We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic $O(n k^2\log{k} / m+n \text{polylog} m)$ time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only $O(k^2\text{polylog} {m})$ space in total. 3) Next we give a randomised $(1+ε)$-approximation algorithm for the streaming k-mismatch problem which uses $O(k^2\text{polylog} m / ε^2)$ space and runs in $O(\text{polylog} m / ε^2)$ worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised $O(k^2\text{polylog} {m})$ space algorithm for the streaming k-mismatch problem which runs in $O(\sqrt{k}\log{k} + \text{polylog} {m})$ worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors).

preprint2015arXiv

Wavelet Trees Meet Suffix Trees

We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size $σ\leq n$, our method builds the wavelet tree in $O(n \log σ/ \sqrt{\log{n}})$ time, improving upon the state-of-the-art algorithm by a factor of $\sqrt{\log n}$. As a consequence, given an array of n integers we can construct in $O(n \sqrt{\log n})$ time a data structure consisting of $O(n)$ machine words and capable of answering rank/select queries for the subranges of the array in $O(\log n / \log \log n)$ time. This is a $\log \log n$-factor improvement in query time compared to Chan and Pătraşcu and a $\sqrt{\log n}$-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies $O(n)$ words, takes $O(n \sqrt{\log n})$ time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in $O(\log |x|)$ time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in $O(s \log |x|)$ time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.

preprint2014arXiv

A Suffix Tree Or Not A Suffix Tree?

In this paper we study the structure of suffix trees. Given an unlabeled tree $τ$ on $n$ nodes and suffix links of its internal nodes, we ask the question "Is $τ$ a suffix tree?", i.e., is there a string $S$ whose suffix tree has the same topological structure as $τ$? We place no restrictions on $S$, in particular we do not require that $S$ ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. We prove that $τ$ is a suffix tree if and only if it is realized by a string $S$ of length $n-1$, and we give a linear-time algorithm for inferring $S$ when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014].

preprint2014arXiv

Sublinear Space Algorithms for the Longest Common Substring Problem

Given $m$ documents of total length $n$, we consider the problem of finding a longest string common to at least $d \geq 2$ of the documents. This problem is known as the \emph{longest common substring (LCS) problem} and has a classic $O(n)$ space and $O(n)$ time solution (Weiner [FOCS'73], Hui [CPM'92]). However, the use of linear space is impractical in many applications. In this paper we show that for any trade-off parameter $1 \leq τ\leq n$, the LCS problem can be solved in $O(τ)$ space and $O(n^2/τ)$ time, thus providing the first smooth deterministic time-space trade-off from constant to linear space. The result uses a new and very simple algorithm, which computes a $τ$-additive approximation to the LCS in $O(n^2/τ)$ time and $O(1)$ space. We also show a time-space trade-off lower bound for deterministic branching programs, which implies that any deterministic RAM algorithm solving the LCS problem on documents from a sufficiently large alphabet in $O(τ)$ space must use $Ω(n\sqrt{\log(n/(τ\log n))/\log\log(n/(τ\log n)})$ time.

preprint2013arXiv

Substring Suffix Selection

We study the following substring suffix selection problem: given a substring of a string T of length n, compute its k-th lexicographically smallest suffix. This a natural generalization of the well-known question of computing the maximal suffix of a string, which is a basic ingredient in many other problems. We first revisit two special cases of the problem, introduced by Babenko, Kolesnichenko and Starikovskaya [CPM'13], in which we are asked to compute the minimal non-empty and the maximal suffixes of a substring. For the maximal suffixes problem, we give a linear-space structure with O(1) query time and linear preprocessing time, i.e., we manage to achieve optimal construction and optimal query time simultaneously. For the minimal suffix problem, we give a linear-space data structure with O(τ) query time and O(n log n / τ) preprocessing time, where 1 <= τ<= log n is a parameter of the data structure. As a sample application, we show that this data structure can be used to compute the Lyndon decomposition of any substring of T in O(k τ) time, where k is the number of distinct factors in the decomposition. Finally, we move to the general case of the substring suffix selection problem, where using any combinatorial properties seems more difficult. Nevertheless, we develop a linear-space data structure with O(log^{2+ε} n) query time.

preprint2012arXiv

Computing Lempel-Ziv Factorization Online

We present an algorithm which computes the Lempel-Ziv factorization of a word $W$ of length $n$ on an alphabet $Σ$ of size $σ$ online in the following sense: it reads $W$ starting from the left, and, after reading each $r = O(\log_σ n)$ characters of $W$, updates the Lempel-Ziv factorization. The algorithm requires $O(n \log σ)$ bits of space and O(n \log^2 n) time. The basis of the algorithm is a sparse suffix tree combined with wavelet trees.

preprint2012arXiv

Cross-Document Pattern Matching

We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem.

preprint2011arXiv

Linear pattern matching on sparse suffix trees

Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to $\log_σn$ characters ($σ$ the alphabet size), our index takes $O(n/\log_σn)$ space, i.e. the same space as the packed string itself. The resulting pattern matching algorithm runs in time $O(m+r^2+r\cdot occ)$, where $m$ is the length of the pattern, $r$ is the actual number of characters stored in a word and $occ$ is the number of pattern occurrences.

Tatiana Starikovskaya

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

An Improved Algorithm for The $k$-Dyck Edit Distance Problem

Pattern matching under DTW distance

Approximating longest common substring with $k$ mismatches: Theory and practice

Generalised Pattern Matching Revisited

Approximate Hamming distance in a stream

Dictionary matching in a stream

On Maximal Unbordered Factors

The k-mismatch problem revisited

Wavelet Trees Meet Suffix Trees

A Suffix Tree Or Not A Suffix Tree?

Sublinear Space Algorithms for the Longest Common Substring Problem

Substring Suffix Selection

Computing Lempel-Ziv Factorization Online

Cross-Document Pattern Matching

Linear pattern matching on sparse suffix trees