Source author record

Hideo Bannai

Hideo Bannai appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Discrete Mathematics math.CO Databases Formal Languages and Automata Theory

Catalog footprint

What is connected

34works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Combinatorics of minimal absent words for a sliding window

A string $w$ is called a minimal absent word (MAW) for another string $T$ if $w$ does not occur in $T$ but the proper substrings of $w$ occur in $T$. For example, let $Σ= \{\mathtt{a, b, c}\}$ be the alphabet. Then, the set of MAWs for string $w = \mathtt{abaab}$ is $\{\mathtt{aaa, aaba, bab, bb, c}\}$. In this paper, we study combinatorial properties of MAWs in the sliding window model, namely, how the set of MAWs changes when a sliding window of fixed length $d$ is shifted over the input string $T$ of length $n$, where $1 \leq d < n$. We present \emph{tight} upper and lower bounds on the maximum number of changes in the set of MAWs for a sliding window over $T$, both in the cases of general alphabets and binary alphabets. Our bounds improve on the previously known best bounds [Crochemore et al., 2020].

preprint2022arXiv

Computing Longest (Common) Lyndon Subsequences

Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two strings of length $n$ in $O(n^4 σ)$ time using $O(n^3)$ space.

preprint2022arXiv

Computing NP-hard Repetitiveness Measures via MAX-SAT

Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors.

preprint2022arXiv

Longest (Sub-)Periodic Subsequence

We present an algorithm computing the longest periodic subsequence of a string of length $n$ in $O(n^7)$ time with $O(n^4)$ words of space. We obtain improvements when restricting the exponents or extending the search allowing the reported subsequence to be subperiodic down to $O(n^3)$ time and $O(n^2)$ words of space.

preprint2021arXiv

Computing longest palindromic substring after single-character or block-wise edits

Palindromes are important objects in strings which have been extensively studied from combinatorial, algorithmic, and bioinformatics points of views. It is known that the length of the longest palindromic substrings (LPSs) of a given string T of length n can be computed in O(n) time by Manacher's algorithm [J. ACM '75]. In this paper, we consider the problem of finding the LPS after the string is edited. We present an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LPSs in O(\log (\min \{σ, \log n\})) time after a single character substitution, insertion, or deletion, where σdenotes the number of distinct characters appearing in T. We also propose an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LPSs in O(\ell + \log \log n) time, after an existing substring in T is replaced by a string of arbitrary length \ell.

preprint2021arXiv

The Parameterized Suffix Tray

Let $Σ$ and $Π$ be disjoint alphabets, respectively called the static alphabet and the parameterized alphabet. Two strings $x$ and $y$ over $Σ\cup Π$ of equal length are said to parameterized match (p-match) if there exists a renaming bijection $f$ on $Σ$ and $Π$ which is identity on $Σ$ and maps the characters of $x$ to those of $y$ so that the two strings become identical. The indexing version of the problem of finding p-matching occurrences of a given pattern in the text is a well-studied topic in string matching. In this paper, we present a state-of-the-art indexing structure for p-matching called the parameterized suffix tray of an input text $T$, denoted by $\mathsf{PSTray}(T)$. We show that $\mathsf{PSTray}(T)$ occupies $O(n)$ space and supports pattern matching queries in $O(m + \log (σ+π) + \mathit{occ})$ time, where $n$ is the length of $T$, $m$ is the length of a query pattern $P$, $π$ is the number of distinct symbols of $|Π|$ in $T$, $σ$ is the number of distinct symbols of $|Σ|$ in $T$ and $\mathit{occ}$ is the number of p-matching occurrences of $P$ in $T$. We also present how to build $\mathsf{PSTray}(T)$ in $O(n)$ time from the parameterized suffix tree of $T$.

preprint2020arXiv

Detecting $k$-(Sub-)Cadences and Equidistant Subsequence Occurrences

The equidistant subsequence pattern matching problem is considered. Given a pattern string $P$ and a text string $T$, we say that $P$ is an \emph{equidistant subsequence} of $T$ if $P$ is a subsequence of the text such that consecutive symbols of $P$ in the occurrence are equally spaced. We can consider the problem of equidistant subsequences as generalizations of (sub-)cadences. We give bit-parallel algorithms that yield $o(n^2)$ time algorithms for finding $k$-(sub-)cadences and equidistant subsequences. Furthermore, $O(n\log^2 n)$ and $O(n\log n)$ time algorithms, respectively for equidistant and Abelian equidistant matching for the case $|P| = 3$, are shown. The algorithms make use of a technique that was recently introduced which can efficiently compute convolutions with linear constraints.

preprint2020arXiv

Fast Algorithms for the Shortest Unique Palindromic Substring Problem on Run-Length Encoded Strings

For a string $S$, a palindromic substring $S[i..j]$ is said to be a \emph{shortest unique palindromic substring} ($\mathit{SUPS}$) for an interval $[s, t]$ in $S$, if $S[i..j]$ occurs exactly once in $S$, the interval $[i, j]$ contains $[s, t]$, and every palindromic substring containing $[s, t]$ which is shorter than $S[i..j]$ occurs at least twice in $S$. In this paper, we study the problem of answering $\mathit{SUPS}$ queries on run-length encoded strings. We show how to preprocess a given run-length encoded string $\mathit{RLE}_{S}$ of size $m$ in $O(m)$ space and $O(m \log σ_{\mathit{RLE}_{S}} + m \sqrt{\log m / \log\log m})$ time so that all $\mathit{SUPSs}$ for any subsequent query interval can be answered in $O(\sqrt{\log m / \log\log m} + α)$ time, where $α$ is the number of outputs, and $σ_{\mathit{RLE}_{S}}$ is the number of distinct runs of $\mathit{RLE}_{S}$. Additionaly, we consider a variant of the SUPS problem where a query interval is also given in a run-length encoded form. For this variant of the problem, we present two alternative algorithms with faster queries. The first one answers queries in $O(\sqrt{\log\log m /\log\log\log m} + α)$ time and can be built in $O(m \log σ_{\mathit{RLE}_{S}} + m \sqrt{\log m / \log\log m})$ time, and the second one answers queries in $O(\log \log m + α)$ time and can be built in $O(m \log σ_{\mathit{RLE}_{S}})$ time. Both of these data structures require $O(m)$ space.

preprint2020arXiv

Faster STR-EC-LCS Computation

The longest common subsequence (LCS) problem is a central problem in stringology that finds the longest common subsequence of given two strings $A$ and $B$. More recently, a set of four constrained LCS problems (called generalized constrained LCS problem) were proposed by Chen and Chao [J. Comb. Optim, 2011]. In this paper, we consider the substring-excluding constrained LCS (STR-EC-LCS) problem. A string $Z$ is said to be an STR-EC-LCS of two given strings $A$ and $B$ excluding $P$ if, $Z$ is one of the longest common subsequences of $A$ and $B$ that does not contain $P$ as a substring. Wang et al. proposed a dynamic programming solution which computes an STR-EC-LCS in $O(mnr)$ time and space where $m = |A|, n = |B|, r = |P|$ [Inf. Process. Lett., 2013]. In this paper, we show a new solution for the STR-EC-LCS problem. Our algorithm computes an STR-EC-LCS in $O(n|Σ| + (L+1)(m-L+1)r)$ time where $|Σ| \leq \min\{m, n\}$ denotes the set of distinct characters occurring in both $A$ and $B$, and $L$ is the length of the STR-EC-LCS. This algorithm is faster than the $O(mnr)$-time algorithm for short/long STR-EC-LCS (namely, $L \in O(1)$ or $m-L \in O(1)$), and is at least as efficient as the $O(mnr)$-time algorithm for all cases.

preprint2020arXiv

Grammar-compressed Self-index with Lyndon Words

We introduce a new class of straight-line programs (SLPs), named the Lyndon SLP, inspired by the Lyndon trees (Barcelo, 1990). Based on this SLP, we propose a self-index data structure of $O(g)$ words of space that can be built from a string $T$ in $O(n \lg n)$ expected time, retrieving the starting positions of all occurrences of a pattern $P$ of length $m$ in $O(m + \lg m \lg n + occ \lg g)$ time, where $n$ is the length of $T$, $g$ is the size of the Lyndon SLP for $T$, and $occ$ is the number of occurrences of $P$ in $T$.

preprint2020arXiv

Longest Square Subsequence Problem Revisited

The longest square subsequence (LSS) problem consists of computing a longest subsequence of a given string $S$ that is a square, i.e., a longest subsequence of form $XX$ appearing in $S$. It is known that an LSS of a string $S$ of length $n$ can be computed using $O(n^2)$ time [Kosowski 2004], or with (model-dependent) polylogarithmic speed-ups using $O(n^2 (\log \log n)^2 / \log^2 n)$ time [Tiskin 2013]. We present the first algorithm for LSS whose running time depends on other parameters, i.e., we show that an LSS of $S$ can be computed in $O(r \min\{n, M\}\log \frac{n}{r} + n + M \log n)$ time with $O(M)$ space, where $r$ is the length of an LSS of $S$ and $M$ is the number of matching points on $S$.

preprint2020arXiv

Lyndon Words, the Three Squares Lemma, and Primitive Squares

We revisit the so-called "Three Squares Lemma" by Crochemore and Rytter [Algorithmica 1995] and, using arguments based on Lyndon words, derive a more general variant which considers three overlapping squares which do not necessarily share a common prefix. We also give an improved upper bound of $n\log_2 n$ on the maximum number of (occurrences of) primitively rooted squares in a string of length $n$, also using arguments based on Lyndon words. To the best of our knowledge, the only known upper bound was $n \log_ϕn \approx 1.441n\log_2 n$, where $ϕ$ is the golden ratio, reported by Fraenkel and Simpson [TCS 1999] obtained via the Three Squares Lemma.

preprint2020arXiv

On repetitiveness measures of Thue-Morse words

We show that the size $γ(t_n)$ of the smallest string attractor of the $n$th Thue-Morse word $t_n$ is 4 for any $n\geq 4$, disproving the conjecture by Mantaci et al. [ICTCS 2019] that it is $n$. We also show that $δ(t_n) = \frac{10}{3+2^{4-n}}$ for $n \geq 3$, where $δ(w)$ is the maximum over all $k = 1,\ldots,|w|$, the number of distinct substrings of length $k$ in $w$ divided by $k$, which is a measure of repetitiveness recently studied by Kociumaka et al. [LATIN 2020]. Furthermore, we show that the number $z(t_n)$ of factors in the self-referencing Lempel-Ziv factorization of $t_n$ is exactly $2n$.

preprint2020arXiv

Space-Efficient Algorithms for Computing Minimal/Shortest Unique Substrings

Given a string $T$ of length $n$, a substring $u = T[i..j]$ of $T$ is called a shortest unique substring (SUS) for an interval $[s,t]$ if (a) $u$ occurs exactly once in $T$, (b) $u$ contains the interval $[s,t]$ (i.e. $i \leq s \leq t \leq j$), and (c) every substring $v$ of $T$ with $|v| < |u|$ containing $[s,t]$ occurs at least twice in $T$. Given a query interval $[s, t] \subset [1, n]$, the interval SUS problem is to output all the SUSs for the interval $[s,t]$. In this article, we propose a $4n + o(n)$ bits data structure answering an interval SUS query in output-sensitive $O(\mathit{occ})$ time, where $\mathit{occ}$ is the number of returned SUSs. Additionally, we focus on the point SUS problem, which is the interval SUS problem for $s = t$. Here, we propose a $\lceil (\log_2{3} + 1)n \rceil + o(n)$ bits data structure answering a point SUS query in the same output-sensitive time. We also propose space-efficient algorithms for computing the minimal unique substrings of $T$.

preprint2020arXiv

Towards Efficient Interactive Computation of Dynamic Time Warping Distance

The dynamic time warping (DTW) is a widely-used method that allows us to efficiently compare two time series that can vary in speed. Given two strings $A$ and $B$ of respective lengths $m$ and $n$, there is a fundamental dynamic programming algorithm that computes the DTW distance for $A$ and $B$ together with an optimal alignment in $Θ(mn)$ time and space. In this paper, we tackle the problem of interactive computation of the DTW distance for dynamic strings, denoted $\mathrm{D^2TW}$, where character-wise edit operation (insertion, deletion, substitution) can be performed at an arbitrary position of the strings. Let $M$ and $N$ be the sizes of the run-length encoding (RLE) of $A$ and $B$, respectively. We present an algorithm for $\mathrm{D^2TW}$ that occupies $Θ(mN+nM)$ space and uses $O(m+n+\#_{\mathrm{chg}}) \subseteq O(mN + nM)$ time to update a compact differential representation $\mathit{DS}$ of the DP table per edit operation, where $\#_{\mathrm{chg}}$ denotes the number of cells in $\mathit{DS}$ whose values change after the edit operation. Our method is at least as efficient as the algorithm recently proposed by Froese et al. running in $Θ(mN + nM)$ time, and is faster when $\#_{\mathrm{chg}}$ is smaller than $O(mN + nM)$ which, as our preliminary experiments suggest, is likely to be the case in the majority of instances.

preprint2016arXiv

Deterministic sub-linear space LCE data structures with efficient construction

Given a string $S$ of $n$ symbols, a longest common extension query $\mathsf{LCE}(i,j)$ asks for the length of the longest common prefix of the $i$th and $j$th suffixes of $S$. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter $1 \leq τ\leq n$, their best deterministic solution is a data structure of size $O(n/τ)$ which allows LCE queries to be answered in $O(τ)$ time. However, the construction time for all deterministic versions of their data structure is quadratic in $n$. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of $O(τ\min\{\logτ,\log\frac{n}τ\})$ query time using $O(n/τ)$ space, but significantly improve the construction time to $O(nτ)$.

preprint2016arXiv

Dynamic index and LZ factorization in compressed space

In this paper, we propose a new \emph{dynamic compressed index} of $O(w)$ space for a dynamic text $T$, where $w = O(\min(z \log N \log^*M, N))$ is the size of the signature encoding of $T$, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, $N$ is the length of $T$, and $M \geq 3N$ is an integer that can be handled in constant time under word RAM model. Our index supports searching for a pattern $P$ in $T$ in $O(|P| f_{\mathcal{A}} + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. Also, we propose a new space-efficient LZ77 factorization algorithm for a given text of length $N$, which runs in $O(N f_{\mathcal{A}} + z \log w \log^3 N (\log^* N)^2)$ time with $O(w)$ working space.

preprint2016arXiv

Dynamic index, LZ factorization, and LCE queries in compressed space

In this paper, we present the following results: (1) We propose a new \emph{dynamic compressed index} of $O(w)$ space, that supports searching for a pattern $P$ in the current text in $O(|P| f(M,w) + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where $N$ is the length of the current text, $M$ is the maximum length of the dynamic text, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of the current text, $f(a,b) = O(\min \{ \frac{\log\log a \log\log b}{\log\log\log a}, \sqrt{\frac{\log b}{\log\log b}} \})$ and $w = O(z \log N \log^*M)$. (2) We propose a new space-efficient LZ77 factorization algorithm for a given text of length $N$, which runs in $O(N f(N,w') + z \log w' \log^3 N (\log^* N)^2)$ time with $O(w')$ working space, where $w' =O(z \log N \log^* N)$. (3) We propose a data structure of $O(w)$ space which supports longest common extension (LCE) queries on the text in $O(\log N + \log \ell \log^* N)$ time, where $\ell$ is the output LCE length. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.

preprint2016arXiv

Fully dynamic data structure for LCE queries in compressed space

A Longest Common Extension (LCE) query on a text $T$ of length $N$ asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding $\mathcal{G}$ of size $w = O(\min(z \log N \log^* M, N))$ [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of $T$, which can be seen as a compressed representation of $T$, has a capability to support LCE queries in $O(\log N + \log \ell \log^* M)$ time, where $\ell$ is the answer to the query, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, and $M \geq 4N$ is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover, $\mathcal{G}$ can be enhanced to support efficient update operations: After processing $\mathcal{G}$ in $O(w f_{\mathcal{A}})$ time, we can insert/delete any (sub)string of length $y$ into/from an arbitrary position of $T$ in $O((y+ \log N\log^* M) f_{\mathcal{A}})$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. This yields the first fully dynamic LCE data structure. We also present efficient construction algorithms from various types of inputs: We can construct $\mathcal{G}$ in $O(N f_{\mathcal{A}})$ time from uncompressed string $T$; in $O(n \log\log n \log N \log^* M)$ time from grammar-compressed string $T$ represented by a straight-line program of size $n$; and in $O(z f_{\mathcal{A}} \log N \log^* M)$ time from LZ77-compressed string $T$ with $z$ factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.

preprint2015arXiv

Constructing LZ78 Tries and Position Heaps in Linear Time for Large Alphabets

We present the first worst-case linear-time algorithm to compute the Lempel-Ziv 78 factorization of a given string over an integer alphabet. Our algorithm is based on nearest marked ancestor queries on the suffix tree of the given string. We also show that the same technique can be used to construct the position heap of a set of strings in worst-case linear time, when the set of strings is given as a trie.

preprint2013arXiv

Computing convolution on grammar-compressed text

The convolution between a text string $S$ of length $N$ and a pattern string $P$ of length $m$ can be computed in $O(N \log m)$ time by FFT. It is known that various types of approximate string matching problems are reducible to convolution. In this paper, we assume that the input text string is given in a compressed form, as a \emph{straight-line program (SLP)}, which is a context free grammar in the Chomsky normal form that derives a single string. Given an SLP $\mathcal{S}$ of size $n$ describing a text $S$ of length $N$, and an uncompressed pattern $P$ of length $m$, we present a simple $O(nm \log m)$-time algorithm to compute the convolution between $S$ and $P$. We then show that this can be improved to $O(\min\{nm, N-α\} \log m)$ time, where $α\geq 0$ is a value that represents the amount of redundancy that the SLP captures with respect to the length-$m$ substrings. The key of the improvement is our new algorithm that computes the convolution between a trie of size $r$ and a pattern string $P$ of length $m$ in $O(r \log m)$ time.

preprint2013arXiv

Detecting regularities on grammar-compressed strings

We solve the problems of detecting and counting various forms of regularities in a string represented as a Straight Line Program (SLP). Given an SLP of size $n$ that represents a string $s$ of length $N$, our algorithm compute all runs and squares in $s$ in $O(n^3h)$ time and $O(n^2)$ space, where $h$ is the height of the derivation tree of the SLP. We also show an algorithm to compute all gapped-palindromes in $O(n^3h + gnh\log N)$ time and $O(n^2)$ space, where $g$ is the length of the gap. The key technique of the above solution also allows us to compute the periods and covers of the string in $O(n^2 h)$ time and $O(nh(n+\log^2 N))$ time, respectively.

preprint2013arXiv

Efficient Lyndon factorization of grammar compressed text

We present an algorithm for computing the Lyndon factorization of a string that is given in grammar compressed form, namely, a Straight Line Program (SLP). The algorithm runs in $O(n^4 + mn^3h)$ time and $O(n^2)$ space, where $m$ is the size of the Lyndon factorization, $n$ is the size of the SLP, and $h$ is the height of the derivation tree of the SLP. Since the length of the decompressed string can be exponentially large w.r.t. $n, m$ and $h$, our result is the first polynomial time solution when the string is given as SLP.

preprint2013arXiv

Faster Compact On-Line Lempel-Ziv Factorization

We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in $O(N\log N)$ time and uses only $O(N\logσ)$ bits of working space, where $N$ is the length of the string and $σ$ is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either $O(N\log^3N)$ time (Okanohara & Sadakane 2009) or $O(N\log^2N)$ time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size $m$ of a string of length $N$, computes the Lempel-Ziv factorization on-line, in $O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right)$ time and $O(m\log N)$ bits of space, which is faster and more space efficient when the string is run-length compressible.

preprint2013arXiv

On the existence of tight relative 2-designs on binary Hamming association schemes

It is known that there is a close analogy between "Euclidean t-designs vs. spherical t-designs" and "Relative t-designs in binary Hamming association schemes vs. combinatorial t-designs". In this paper, we want to prove how much we can develop a similar theory in the latter situation, imitating the theory in the former one. We first prove that the weight function is constant on each shell for tight relative t-designs on p shells on a wide class of Q-polynomial association schemes, including Hamming association schemes. In the theory of Euclidean t-designs on 2 concentric spheres (shells), it is known that the structure of coherent configurations is naturally attached. However, it seems difficult to prove this claim in a general context. In the case of tight 2-designs in combinatorial 2-designs, there are great many tight 2-designs, i.e., symmetric 2-designs, while there are very few tight 2e-designs for e no less than 2. So, as a starting point, we concentrate our study to the existence problem of tight relative 2-designs, in particular on 2 shells, in binary Hamming association schemes H(n,2). We prove that every tight relative 2-designs on 2 shells in H(n,2) has the structure of coherent configuration. We determined all the possible parameters of coherent configurations attached to such tight relative 2-designs for n at most 30. Moreover for each of them we determined whether there exists such a tight relative 2-design or not, either by constructing them from symmetric 2-designs or Hadamard matrices, or theoretically showing the non-existence. In particular, we show that for n congruent to 6 (mod 8), there exist such tight relative 2-designs whose weight functions are not constant. These are the first examples of those with non-constant weight.

preprint2013arXiv

Simpler and Faster Lempel Ziv Factorization

We present a new, simple, and efficient approach for computing the Lempel-Ziv (LZ77) factorization of a string in linear time, based on suffix arrays. Computational experiments on various data sets show that our approach constantly outperforms the currently fastest algorithm LZ OG (Ohlebusch and Gog 2011), and can be up to 2 to 3 times faster in the processing after obtaining the suffix array, while requiring the same or a little more space.

preprint2013arXiv

Space Efficient Linear Time Lempel-Ziv Factorization on Constant~Size~Alphabets

We present a new algorithm for computing the Lempel-Ziv Factorization (LZ77) of a given string of length $N$ in linear time, that utilizes only $N\log N + O(1)$ bits of working space, i.e., a single integer array, for constant size integer alphabets. This greatly improves the previous best space requirement for linear time LZ77 factorization (Kärkkäinen et al. CPM 2013), which requires two integer arrays of length $N$. Computational experiments show that despite the added complexity of the algorithm, the speed of the algorithm is only around twice as slow as previous fastest linear time algorithms.

preprint2013arXiv

Time and Space Efficient Lempel-Ziv Factorization based on Run Length Encoding

We propose a new approach for calculating the Lempel-Ziv factorization of a string, based on run length encoding (RLE). We present a conceptually simple off-line algorithm based on a variant of suffix arrays, as well as an on-line algorithm based on a variant of directed acyclic word graphs (DAWGs). Both algorithms run in $O(N+n\log n)$ time and O(n) extra space, where N is the size of the string, $n\leq N$ is the number of RLE factors. The time dependency on N is only in the conversion of the string to RLE, which can be computed very efficiently in O(N) time and O(1) extra space (excluding the output). When the string is compressible via RLE, i.e., $n = o(N)$, our algorithms are, to the best of our knowledge, the first algorithms which require only o(N) extra space while running in $o(N\log N)$ time.

preprint2012arXiv

Efficient LZ78 factorization of grammar compressed text

We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size $n$ representing a text $S$ of length $N$, our algorithm computes the LZ78 factorization of $T$ in $O(n\sqrt{N}+m\log N)$ time and $O(n\sqrt{N}+m)$ space, where $m$ is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the $n\sqrt{N}$ term in the time and space complexities becomes either $nL$, where $L$ is the length of the longest LZ78 factor, or $(N - α)$ where $α\geq 0$ is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of $S$ of a certain length. Since $m = O(N/\log_σN)$ where $σ$ is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when $σ$ is constant, and can be more efficient when the text is compressible, i.e. when $m$ and $n$ are small.

preprint2012arXiv

Speeding-up $q$-gram mining on grammar-based compressed texts

We present an efficient algorithm for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size $n$ that represents string $T$, the algorithm computes the occurrence frequencies of all $q$-grams in $T$, by reducing the problem to the weighted $q$-gram frequencies problem on a trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where $\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of redundancy that the SLP captures with respect to $q$-grams. The reduced problem can be solved in linear time. Since $m = O(qn)$, the running time of our algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our previous $O(qn)$ algorithm when $q = Ω(|T|/n)$.

preprint2011arXiv

Computing q-gram Frequencies on Collage Systems

Collage systems are a general framework for representing outputs of various text compression algorithms. We consider the all $q$-gram frequency problem on compressed string represented as a collage system, and present an $O((q+h\log n)n)$-time $O(qn)$-space algorithm for calculating the frequencies for all $q$-grams that occur in the string. Here, $n$ and $h$ are respectively the size and height of the collage system.

preprint2011arXiv

Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Length-$q$ substrings, or $q$-grams, can represent important characteristics of text data, and determining the frequencies of all $q$-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the {\em non-overlapping frequencies} of all $q$-grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in $O(q^2n)$ time and $O(qn)$ space where $n$ is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for $q=2$ in $O(n^4\log n)$ time and $O(n^3)$ space.

preprint2011arXiv

Fast $q$-gram Mining on SLP Compressed Strings

We present simple and efficient algorithms for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size $n$ that represents string $T$, we present an $O(qn)$ time and space algorithm that computes the occurrence frequencies of $q$-grams in $T$. Computational experiments show that our algorithm and its variation are practical for small $q$, actually running faster on various real string data, compared to algorithms that work on the uncompressed text. We also discuss applications in data mining and classification of string data, for which our algorithms can be useful.

preprint2011arXiv

Restructuring Compressed Texts without Explicit Decompression

We consider the problem of {\em restructuring} compressed texts without explicit decompression. We present algorithms which allow conversions from compressed representations of a string $T$ produced by any grammar-based compression algorithm, to representations produced by several specific compression algorithms including LZ77, LZ78, run length encoding, and some grammar based compression algorithms. These are the first algorithms that achieve running times polynomial in the size of the compressed input and output representations of $T$. Since most of the representations we consider can achieve exponential compression, our algorithms are theoretically faster in the worst case, than any algorithm which first decompresses the string for the conversion.

Hideo Bannai

What is connected

Connect this record

See the researcher in context

Building this map preview

34 published item(s)

Combinatorics of minimal absent words for a sliding window

Computing Longest (Common) Lyndon Subsequences

Computing NP-hard Repetitiveness Measures via MAX-SAT

Longest (Sub-)Periodic Subsequence

Computing longest palindromic substring after single-character or block-wise edits

The Parameterized Suffix Tray

Detecting $k$-(Sub-)Cadences and Equidistant Subsequence Occurrences

Fast Algorithms for the Shortest Unique Palindromic Substring Problem on Run-Length Encoded Strings

Faster STR-EC-LCS Computation

Grammar-compressed Self-index with Lyndon Words

Longest Square Subsequence Problem Revisited

Lyndon Words, the Three Squares Lemma, and Primitive Squares

On repetitiveness measures of Thue-Morse words

Space-Efficient Algorithms for Computing Minimal/Shortest Unique Substrings

Towards Efficient Interactive Computation of Dynamic Time Warping Distance

Deterministic sub-linear space LCE data structures with efficient construction

Dynamic index and LZ factorization in compressed space

Dynamic index, LZ factorization, and LCE queries in compressed space

Fully dynamic data structure for LCE queries in compressed space

Constructing LZ78 Tries and Position Heaps in Linear Time for Large Alphabets

Computing convolution on grammar-compressed text

Detecting regularities on grammar-compressed strings

Efficient Lyndon factorization of grammar compressed text

Faster Compact On-Line Lempel-Ziv Factorization

On the existence of tight relative 2-designs on binary Hamming association schemes

Simpler and Faster Lempel Ziv Factorization

Space Efficient Linear Time Lempel-Ziv Factorization on Constant~Size~Alphabets

Time and Space Efficient Lempel-Ziv Factorization based on Run Length Encoding

Efficient LZ78 factorization of grammar compressed text

Speeding-up $q$-gram mining on grammar-based compressed texts

Computing q-gram Frequencies on Collage Systems

Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Fast $q$-gram Mining on SLP Compressed Strings

Restructuring Compressed Texts without Explicit Decompression