Source author record

Dmitry Kosolobov

Dmitry Kosolobov appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Formal Languages and Automata Theory

Catalog footprint

What is connected

10works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Lempel-Ziv-like Parsing in Small Space

Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the $k$th order empirical entropy compression $n H_k + o(n\logσ)$ with $k = o(\log_σn)$, where $n$ is the input length and $σ$ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be $Ω(\log n)$ times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below $2.0$ in all tested scenarios, and sometimes below $1.05$, to the size of LZ.

preprint2020arXiv

Optimal Skeleton Huffman Trees Revisited

A skeleton Huffman tree is a Huffman tree in which all disjoint maximal perfect subtrees are shrunk into leaves. Skeleton Huffman trees, besides saving storage space, are also used for faster decoding and for speeding up Huffman-shaped wavelet trees. In 2017 Klein et al. introduced an optimal skeleton tree: for given symbol frequencies, it has the least number of nodes among all optimal prefix-free code trees (not necessarily Huffman's) with shrunk perfect subtrees. Klein et al. described a simple algorithm that, for fixed codeword lengths, finds a skeleton tree with the least number of nodes; with this algorithm one can process each set of optimal codeword lengths to find an optimal skeleton tree. However, there are exponentially many such sets in the worst case. We describe an $O(n^2\log n)$-time algorithm that, given $n$ symbol frequencies, constructs an optimal skeleton tree and its corresponding optimal code.

preprint2016arXiv

Finding the Leftmost Critical Factorization on Unordered Alphabet

We present a linear time and space algorithm computing the leftmost critical factorization of a given string on an unordered alphabet.

preprint2015arXiv

$\mathrm{Pal}^k$ Is Linear Recognizable Online

Given a language $L$ that is online recognizable in linear time and space, we construct a linear time and space online recognition algorithm for the language $L\cdot\mathrm{Pal}$, where $\mathrm{Pal}$ is the language of all nonempty palindromes. Hence for every fixed positive $k$, $\mathrm{Pal}^k$ is online recognizable in linear time and space. Thus we solve an open problem posed by Galil and Seiferas in 1978.

preprint2015arXiv

Computing Runs on a General Alphabet

We describe a RAM algorithm computing all runs (maximal repetitions) of a given string of length $n$ over a general ordered alphabet in $O(n\log^{\frac{2}3} n)$ time and linear space. Our algorithm outperforms all known solutions working in $Θ(n\logσ)$ time provided $σ= n^{Ω(1)}$, where $σ$ is the alphabet size. We conjecture that there exists a linear time RAM algorithm finding all runs.

preprint2015arXiv

Faster Lightweight Lempel-Ziv Parsing

We present an algorithm that computes the Lempel-Ziv decomposition in $O(n(\logσ+ \log\log n))$ time and $n\logσ+ εn$ bits of space, where $ε$ is a constant rational parameter, $n$ is the length of the input string, and $σ$ is the alphabet size. The $n\logσ$ bits in the space bound are for the input string itself which is treated as read-only.

preprint2015arXiv

Online Detection of Repetitions with Backtracking

In this paper we present two algorithms for the following problem: given a string and a rational $e > 1$, detect in the online fashion the earliest occurrence of a repetition of exponent $\ge e$ in the string. 1. The first algorithm supports the backtrack operation removing the last letter of the input string. This solution runs in $O(n\log m)$ time and $O(m)$ space, where $m$ is the maximal length of a string generated during the execution of a given sequence of $n$ read and backtrack operations. 2. The second algorithm works in $O(n\logσ)$ time and $O(n)$ space, where $n$ is the length of the input string and $σ$ is the number of distinct letters. This algorithm is relatively simple and requires much less memory than the previously known solution with the same working time and space. a string generated during the execution of a given sequence of $n$ read and backtrack operations.

preprint2014arXiv

Lempel-Ziv Factorization May Be Harder Than Computing All Runs

The complexity of computing the Lempel-Ziv factorization and the set of all runs (= maximal repetitions) is studied in the decision tree model of computation over ordered alphabet. It is known that both these problems can be solved by RAM algorithms in $O(n\logσ)$ time, where $n$ is the length of the input string and $σ$ is the number of distinct letters in it. We prove an $Ω(n\logσ)$ lower bound on the number of comparisons required to construct the Lempel-Ziv factorization and thereby conclude that a popular technique of computation of runs using the Lempel-Ziv factorization cannot achieve an $o(n\logσ)$ time bound. In contrast with this, we exhibit an $O(n)$ decision tree algorithm finding all runs in a string. Therefore, in the decision tree model the runs problem is easier than the Lempel-Ziv factorization. Thus we support the conjecture that there is a linear RAM algorithm finding all runs.

preprint2014arXiv

Online Square Detection

The online square detection problem is to detect the first occurrence of a square in a string whose characters are provided as input one at a time. Recall that a square is a string that is a concatenation of two identical strings. In this paper we present an algorithm solving this problem in $O(n\logσ)$ time and linear space on ordered alphabet, where $σ$ is the number of different letters in the input string. Our solution is relatively simple and does not require much memory unlike the previously known online algorithm with the same working time. Also we present an algorithm working in $O(n\log n)$ time and linear space on unordered alphabet, though this solution does not outperform the previously known result with the same time bound.

preprint2013arXiv

Finding Distinct Subpalindromes Online

We exhibit an online algorithm finding all distinct palindromes inside a given string in time $Θ(n\log|Σ|)$ over an ordered alphabet and in time $Θ(n|Σ|)$ over an unordered alphabet. Using a reduction from a dictionary-like data structure, we prove the optimality of this algorithm in the comparison-based computation model.

Dmitry Kosolobov

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Lempel-Ziv-like Parsing in Small Space

Optimal Skeleton Huffman Trees Revisited

Finding the Leftmost Critical Factorization on Unordered Alphabet

$\mathrm{Pal}^k$ Is Linear Recognizable Online

Computing Runs on a General Alphabet

Faster Lightweight Lempel-Ziv Parsing

Online Detection of Repetitions with Backtracking

Lempel-Ziv Factorization May Be Harder Than Computing All Runs

Online Square Detection

Finding Distinct Subpalindromes Online