Source author record

Shunsuke Inenaga

Shunsuke Inenaga appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Discrete Mathematics Databases math.CO Formal Languages and Automata Theory

Catalog footprint

What is connected

37works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

Faster and Simpler Online Computation of String Net Frequency

An occurrence of a repeated substring $u$ in a string $S$ is called a net occurrence if extending the occurrence to the left or to the right decreases the number of occurrences to 1. The net frequency (NF) of a repeated substring $u$ in a string $S$ is the number of net occurrences of $u$ in $S$. Very recently, Guo et al. [SPIRE 2024] proposed an online $O(n \log σ)$-time algorithm that maintains a data structure of $O(n)$ space which answers Single-NF queries in $O(m\log σ+ σ^2)$ time and reports all answers of the All-NF problem in $O(nσ^2)$ time. Here, $n$ is the length of the input string $S$, $m$ is the query pattern length, and $σ$ is the alphabet size. The $σ^2$ term is a major drawback of their method since computing string net frequencies is originally motivated for Chinese language processing where $σ$ can be thousands large. This paper presents an improved online $O(n \log σ)$-time algorithm, which answers Single-NF queries in $O(m \log σ)$ time and reports all answers to the All-NF problem in output-optimal $O(|\mathsf{NF}^+(S)|)$ time, where $\mathsf{NF}^+(S)$ is the set of substrings of $S$ paired with their positive NF values. We note that $|\mathsf{NF}^+(S)| = O(n)$ always holds. In contract to Guo et al.'s algorithm that is based on Ukkonen's suffix tree construction, our algorithm is based on Weiner's suffix tree construction.

preprint2025arXiv

Subsequence Matching and LCS under Cartesian-Tree Equivalence

Two strings of the same length are said to Cartesian-tree match (CT-match) if their Cartesian-trees are isomorphic [Park et al., TCS 2020]. Cartesian-tree matching is a natural model that allows for capturing similarities of numerical sequences. Oizumi et al. [CPM 2022] showed that subsequence pattern matching under CT-matching model (CT-MSeq) can be solved in $O(nm \log \log n)$ time, where $n$ and $m$ are text and pattern lengths, respectively. This current article follows this line of research, and gives the following new results: (1) An $O(nm)$-time CT-MSeq algorithm for binary alphabets; (2) An $O((nm)^{1-ε})$-time conditional lower bound for the CT-MSeq problem on alphabets of size 4, for any constant $ε> 0$, under the Orthogonal Vector Hypothesis (OVH). Further, we introduce the new problem of longest common subsequence under CT-matching (CT-LCS) for two given strings $S$ and $T$ of length $n$, and present the following results: (3) An $O(n^6)$-time CT-LCS algorithm for general ordered alphabets; (4) An $O(n^2 / \log n)$-time CT-LCS algorithm for binary alphabets; (5) An $O(n^{2-ε})$-time conditional lower bound for the CT-LCS problem on alphabets of size 5, for any constant $ε> 0$, under OVH.

preprint2022arXiv

A faster reduction of the dynamic time warping distance to the longest increasing subsequence length

The similarity between a pair of time series, i.e., sequences of indexed values in time order, is often estimated by the dynamic time warping (DTW) distance, instead of any in the well-studied family of measures including the longest common subsequence (LCS) length and the edit distance. Although it may seem as if the DTW and the LCS(-like) measures are essentially different, we reveal that the DTW distance can be represented by the longest increasing subsequence (LIS) length of a sequence of integers, which is the LCS length between the integer sequence and itself sorted. For a given pair of time series of length $n$ such that the dissimilarity between any elements is an integer between zero and $c$, we propose an integer sequence that represents any substring-substring DTW distance as its band-substring LIS length. The length of the produced integer sequence is $O(c n^2)$, which can be translated to $O(n^2)$ for constant dissimilarity functions. To demonstrate that techniques developed under the LCS(-like) measures are directly applicable to analysis of time series via our reduction of DTW to LIS, we present time-efficient algorithms for DTW-related problems utilizing the semi-local sequence comparison technique developed for LCS-related problems.

preprint2022arXiv

Cartesian Tree Subsequence Matching

Park et al. [TCS 2020] observed that the similarity between two (numerical) strings can be captured by the Cartesian trees: The Cartesian tree of a string is a binary tree recursively constructed by picking up the smallest value of the string as the root of the tree. Two strings of equal length are said to Cartesian-tree match if their Cartesian trees are isomorphic. Park et al. [TCS 2020] introduced the following Cartesian tree substring matching (CTMStr) problem: Given a text string $T$ of length $n$ and a pattern string of length $m$, find every consecutive substring $S = T[i..j]$ of a text string $T$ such that $S$ and $P$ Cartesian-tree match. They showed how to solve this problem in $\tilde{O}(n+m)$ time. In this paper, we introduce the Cartesian tree subsequence matching (CTMSeq) problem, that asks to find every minimal substring $S = T[i..j]$ of $T$ such that $S$ contains a subsequence $S'$ which Cartesian-tree matches $P$. We prove that the CTMSeq problem can be solved efficiently, in $O(m n p(n))$ time, where $p(n)$ denotes the update/query time for dynamic predecessor queries. By using a suitable dynamic predecessor data structure, we obtain $O(mn \log \log n)$-time and $O(n \log m)$-space solution for CTMSeq. This contrasts CTMSeq with closely related order-preserving subsequence matching (OPMSeq) which was shown to be NP-hard by Bose et al. [IPL 1998].

preprint2022arXiv

Combinatorics of minimal absent words for a sliding window

A string $w$ is called a minimal absent word (MAW) for another string $T$ if $w$ does not occur in $T$ but the proper substrings of $w$ occur in $T$. For example, let $Σ= \{\mathtt{a, b, c}\}$ be the alphabet. Then, the set of MAWs for string $w = \mathtt{abaab}$ is $\{\mathtt{aaa, aaba, bab, bb, c}\}$. In this paper, we study combinatorial properties of MAWs in the sliding window model, namely, how the set of MAWs changes when a sliding window of fixed length $d$ is shifted over the input string $T$ of length $n$, where $1 \leq d < n$. We present \emph{tight} upper and lower bounds on the maximum number of changes in the set of MAWs for a sliding window over $T$, both in the cases of general alphabets and binary alphabets. Our bounds improve on the previously known best bounds [Crochemore et al., 2020].

preprint2022arXiv

Minimal Absent Words on Run-Length Encoded Strings

A string $w$ is called a minimal absent word (MAW) for another string $T$ if $w$ does not occur (as a substring) in $T$ and any proper substring of $w$ occurs in $T$. State-of-the-art data structures for reporting the set $\mathsf{MAW}(T)$ of MAWs from a given string $T$ of length $n$ require $O(n)$ space, can be built in $O(n)$ time, and can report all MAWs in $O(|\mathsf{MAW}(T)|)$ time upon a query. This paper initiates the problem of computing MAWs from a compressed representation of a string. In particular, we focus on the most basic compressed representation of a string, run-length encoding (RLE), which represents each maximal run of the same characters $a$ by $a^p$ where $p$ is the length of the run. Let $m$ be the RLE-size of string $T$. After categorizing the MAWs into five disjoint sets $\mathcal{M}_1$, $\mathcal{M}_2$, $\mathcal{M}_3$, $\mathcal{M}_4$, $\mathcal{M}_5$ using RLE, we present matching upper and lower bounds for the number of MAWs in $\mathcal{M}_i$ for $i = 1,2,4,5$ in terms of RLE-size $m$, except for $\mathcal{M}_3$ whose size is unbounded by $m$. We then present a compact $O(m)$-space data structure that can report all MAWs in optimal $O(|\mathsf{MAW}(T)|)$ time.

preprint2022arXiv

RePair Grammars are the Smallest Grammars for Fibonacci Words

Grammar-based compression is a loss-less data compression scheme that represents a given string $w$ by a context-free grammar that generates only $w$. While computing the smallest grammar which generates a given string $w$ is NP-hard in general, a number of polynomial-time grammar-based compressors which work well in practice have been proposed. RePair, proposed by Larsson and Moffat in 1999, is a grammar-based compressor which recursively replaces all possible occurrences of a most frequently occurring bigrams in the string. Since there can be multiple choices of the most frequent bigrams to replace, different implementations of RePair can result in different grammars. In this paper, we show that the smallest grammars generating the Fibonacci words $F_k$ can be completely characterized by RePair, where $F_k$ denotes the $k$-th Fibonacci word. Namely, all grammars for $F_k$ generated by any implementation of RePair are the smallest grammars for $F_k$, and no other grammars can be the smallest for $F_k$. To the best of our knowledge, Fibonacci words are the first non-trivial infinite family of strings for which RePair is optimal.

preprint2022arXiv

Towards a complete perspective on labeled tree indexing: new size bounds, efficient constructions, and beyond

A labeled tree (or a trie) is a natural generalization of a string, which can also be seen as a compact representation of a set of strings. This paper considers the labeled tree indexing problem, and provides a number of new results on space bound analysis, and on algorithms for efficient construction and pattern matching queries. Kosaraju [FOCS 1989] was the first to consider the labeled tree indexing problem, and he proposed the suffix tree for a backward trie, where the strings in the trie are read in the leaf-to-root direction. In contrast to a backward trie, we call a usual trie as a forward trie. Despite a few follow-up works after Kosaraju's paper, indexing forward/backward tries is not well understood yet. In this paper, we show a full perspective on the sizes of indexing structures such as suffix trees, DAWGs, CDAWGs, suffix arrays, affix trees, affix arrays for forward and backward tries. Some of them take $O(n)$ space in the size $n$ of the input trie, while the others can occupy $O(n^2)$ space in the worst case. In particular, we show that the size of the DAWG for a forward trie with $n$ nodes is $Ω(σn)$, where $σ$ is the number of distinct characters in the trie. This becomes $Ω(n^2)$ for an alphabet of size $σ= Θ(n)$. Still, we show that there is a compact $O(n)$-space implicit representation of the DAWG for a forward trie, whose space requirement is independent of the alphabet size. This compact representation allows for simulating each DAWG edge traversal in $O(\log σ)$ time, and can be constructed in $O(n)$ time and space over any integer alphabet of size $O(n)$. In addition, this readily extends to the first indexing structure that permits bidirectional pattern searches over a trie within linear space in the input trie size.

preprint2021arXiv

Computing longest palindromic substring after single-character or block-wise edits

Palindromes are important objects in strings which have been extensively studied from combinatorial, algorithmic, and bioinformatics points of views. It is known that the length of the longest palindromic substrings (LPSs) of a given string T of length n can be computed in O(n) time by Manacher's algorithm [J. ACM '75]. In this paper, we consider the problem of finding the LPS after the string is edited. We present an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LPSs in O(\log (\min \{σ, \log n\})) time after a single character substitution, insertion, or deletion, where σdenotes the number of distinct characters appearing in T. We also propose an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LPSs in O(\ell + \log \log n) time, after an existing substring in T is replaced by a string of arbitrary length \ell.

preprint2021arXiv

The Parameterized Suffix Tray

Let $Σ$ and $Π$ be disjoint alphabets, respectively called the static alphabet and the parameterized alphabet. Two strings $x$ and $y$ over $Σ\cup Π$ of equal length are said to parameterized match (p-match) if there exists a renaming bijection $f$ on $Σ$ and $Π$ which is identity on $Σ$ and maps the characters of $x$ to those of $y$ so that the two strings become identical. The indexing version of the problem of finding p-matching occurrences of a given pattern in the text is a well-studied topic in string matching. In this paper, we present a state-of-the-art indexing structure for p-matching called the parameterized suffix tray of an input text $T$, denoted by $\mathsf{PSTray}(T)$. We show that $\mathsf{PSTray}(T)$ occupies $O(n)$ space and supports pattern matching queries in $O(m + \log (σ+π) + \mathit{occ})$ time, where $n$ is the length of $T$, $m$ is the length of a query pattern $P$, $π$ is the number of distinct symbols of $|Π|$ in $T$, $σ$ is the number of distinct symbols of $|Σ|$ in $T$ and $\mathit{occ}$ is the number of p-matching occurrences of $P$ in $T$. We also present how to build $\mathsf{PSTray}(T)$ in $O(n)$ time from the parameterized suffix tree of $T$.

preprint2020arXiv

Detecting $k$-(Sub-)Cadences and Equidistant Subsequence Occurrences

The equidistant subsequence pattern matching problem is considered. Given a pattern string $P$ and a text string $T$, we say that $P$ is an \emph{equidistant subsequence} of $T$ if $P$ is a subsequence of the text such that consecutive symbols of $P$ in the occurrence are equally spaced. We can consider the problem of equidistant subsequences as generalizations of (sub-)cadences. We give bit-parallel algorithms that yield $o(n^2)$ time algorithms for finding $k$-(sub-)cadences and equidistant subsequences. Furthermore, $O(n\log^2 n)$ and $O(n\log n)$ time algorithms, respectively for equidistant and Abelian equidistant matching for the case $|P| = 3$, are shown. The algorithms make use of a technique that was recently introduced which can efficiently compute convolutions with linear constraints.

preprint2020arXiv

Fast Algorithms for the Shortest Unique Palindromic Substring Problem on Run-Length Encoded Strings

For a string $S$, a palindromic substring $S[i..j]$ is said to be a \emph{shortest unique palindromic substring} ($\mathit{SUPS}$) for an interval $[s, t]$ in $S$, if $S[i..j]$ occurs exactly once in $S$, the interval $[i, j]$ contains $[s, t]$, and every palindromic substring containing $[s, t]$ which is shorter than $S[i..j]$ occurs at least twice in $S$. In this paper, we study the problem of answering $\mathit{SUPS}$ queries on run-length encoded strings. We show how to preprocess a given run-length encoded string $\mathit{RLE}_{S}$ of size $m$ in $O(m)$ space and $O(m \log σ_{\mathit{RLE}_{S}} + m \sqrt{\log m / \log\log m})$ time so that all $\mathit{SUPSs}$ for any subsequent query interval can be answered in $O(\sqrt{\log m / \log\log m} + α)$ time, where $α$ is the number of outputs, and $σ_{\mathit{RLE}_{S}}$ is the number of distinct runs of $\mathit{RLE}_{S}$. Additionaly, we consider a variant of the SUPS problem where a query interval is also given in a run-length encoded form. For this variant of the problem, we present two alternative algorithms with faster queries. The first one answers queries in $O(\sqrt{\log\log m /\log\log\log m} + α)$ time and can be built in $O(m \log σ_{\mathit{RLE}_{S}} + m \sqrt{\log m / \log\log m})$ time, and the second one answers queries in $O(\log \log m + α)$ time and can be built in $O(m \log σ_{\mathit{RLE}_{S}})$ time. Both of these data structures require $O(m)$ space.

preprint2020arXiv

Faster STR-EC-LCS Computation

The longest common subsequence (LCS) problem is a central problem in stringology that finds the longest common subsequence of given two strings $A$ and $B$. More recently, a set of four constrained LCS problems (called generalized constrained LCS problem) were proposed by Chen and Chao [J. Comb. Optim, 2011]. In this paper, we consider the substring-excluding constrained LCS (STR-EC-LCS) problem. A string $Z$ is said to be an STR-EC-LCS of two given strings $A$ and $B$ excluding $P$ if, $Z$ is one of the longest common subsequences of $A$ and $B$ that does not contain $P$ as a substring. Wang et al. proposed a dynamic programming solution which computes an STR-EC-LCS in $O(mnr)$ time and space where $m = |A|, n = |B|, r = |P|$ [Inf. Process. Lett., 2013]. In this paper, we show a new solution for the STR-EC-LCS problem. Our algorithm computes an STR-EC-LCS in $O(n|Σ| + (L+1)(m-L+1)r)$ time where $|Σ| \leq \min\{m, n\}$ denotes the set of distinct characters occurring in both $A$ and $B$, and $L$ is the length of the STR-EC-LCS. This algorithm is faster than the $O(mnr)$-time algorithm for short/long STR-EC-LCS (namely, $L \in O(1)$ or $m-L \in O(1)$), and is at least as efficient as the $O(mnr)$-time algorithm for all cases.

preprint2020arXiv

Grammar-compressed Self-index with Lyndon Words

We introduce a new class of straight-line programs (SLPs), named the Lyndon SLP, inspired by the Lyndon trees (Barcelo, 1990). Based on this SLP, we propose a self-index data structure of $O(g)$ words of space that can be built from a string $T$ in $O(n \lg n)$ expected time, retrieving the starting positions of all occurrences of a pattern $P$ of length $m$ in $O(m + \lg m \lg n + occ \lg g)$ time, where $n$ is the length of $T$, $g$ is the size of the Lyndon SLP for $T$, and $occ$ is the number of occurrences of $P$ in $T$.

preprint2020arXiv

Longest Square Subsequence Problem Revisited

The longest square subsequence (LSS) problem consists of computing a longest subsequence of a given string $S$ that is a square, i.e., a longest subsequence of form $XX$ appearing in $S$. It is known that an LSS of a string $S$ of length $n$ can be computed using $O(n^2)$ time [Kosowski 2004], or with (model-dependent) polylogarithmic speed-ups using $O(n^2 (\log \log n)^2 / \log^2 n)$ time [Tiskin 2013]. We present the first algorithm for LSS whose running time depends on other parameters, i.e., we show that an LSS of $S$ can be computed in $O(r \min\{n, M\}\log \frac{n}{r} + n + M \log n)$ time with $O(M)$ space, where $r$ is the length of an LSS of $S$ and $M$ is the number of matching points on $S$.

preprint2020arXiv

On repetitiveness measures of Thue-Morse words

We show that the size $γ(t_n)$ of the smallest string attractor of the $n$th Thue-Morse word $t_n$ is 4 for any $n\geq 4$, disproving the conjecture by Mantaci et al. [ICTCS 2019] that it is $n$. We also show that $δ(t_n) = \frac{10}{3+2^{4-n}}$ for $n \geq 3$, where $δ(w)$ is the maximum over all $k = 1,\ldots,|w|$, the number of distinct substrings of length $k$ in $w$ divided by $k$, which is a measure of repetitiveness recently studied by Kociumaka et al. [LATIN 2020]. Furthermore, we show that the number $z(t_n)$ of factors in the self-referencing Lempel-Ziv factorization of $t_n$ is exactly $2n$.

preprint2020arXiv

Pointer-Machine Algorithms for Fully-Online Construction of Suffix Trees and DAWGs on Multiple Strings

We deal with the problem of maintaining the suffix tree indexing structure for a fully-online collection of multiple strings, where a new character can be prepended to any string in the collection at any time. The only previously known algorithm for the problem, recently proposed by Takagi et al. [Algorithmica 82(5): 1346-1377 (2020)], runs in $O(N \log σ)$ time and $O(N)$ space on the word RAM model, where $N$ denotes the total length of the strings and $σ$ denotes the alphabet size. Their algorithm makes heavy use of the nearest marked ancestor (NMA) data structure on semi-dynamic trees, that can answer queries and supports insertion of nodes in $O(1)$ amortized time on the word RAM model. In this paper, we present a simpler fully-online right-to-left algorithm that builds the suffix tree for a given string collection in $O(N (\log σ+ \log d))$ time and $O(N)$ space, where $d$ is the maximum number of in-coming Weiner links to a node of the suffix tree. We note that $d$ is bounded by the height of the suffix tree, which is further bounded by the length of the longest string in the collection. The advantage of this new algorithm is that it works on the pointer machine model, namely, it does not use the complicated NMA data structures that involve table look-ups. As a byproduct, we also obtain a pointer-machine algorithm for building the directed acyclic word graph (DAWG) for a fully-online left-to-right collection of multiple strings, which runs in $O(N (\log σ+ \log d))$ time and $O(N)$ space again without the aid of the NMA data structures.

preprint2020arXiv

Space-Efficient Algorithms for Computing Minimal/Shortest Unique Substrings

Given a string $T$ of length $n$, a substring $u = T[i..j]$ of $T$ is called a shortest unique substring (SUS) for an interval $[s,t]$ if (a) $u$ occurs exactly once in $T$, (b) $u$ contains the interval $[s,t]$ (i.e. $i \leq s \leq t \leq j$), and (c) every substring $v$ of $T$ with $|v| < |u|$ containing $[s,t]$ occurs at least twice in $T$. Given a query interval $[s, t] \subset [1, n]$, the interval SUS problem is to output all the SUSs for the interval $[s,t]$. In this article, we propose a $4n + o(n)$ bits data structure answering an interval SUS query in output-sensitive $O(\mathit{occ})$ time, where $\mathit{occ}$ is the number of returned SUSs. Additionally, we focus on the point SUS problem, which is the interval SUS problem for $s = t$. Here, we propose a $\lceil (\log_2{3} + 1)n \rceil + o(n)$ bits data structure answering a point SUS query in the same output-sensitive time. We also propose space-efficient algorithms for computing the minimal unique substrings of $T$.

preprint2020arXiv

Towards Efficient Interactive Computation of Dynamic Time Warping Distance

The dynamic time warping (DTW) is a widely-used method that allows us to efficiently compare two time series that can vary in speed. Given two strings $A$ and $B$ of respective lengths $m$ and $n$, there is a fundamental dynamic programming algorithm that computes the DTW distance for $A$ and $B$ together with an optimal alignment in $Θ(mn)$ time and space. In this paper, we tackle the problem of interactive computation of the DTW distance for dynamic strings, denoted $\mathrm{D^2TW}$, where character-wise edit operation (insertion, deletion, substitution) can be performed at an arbitrary position of the strings. Let $M$ and $N$ be the sizes of the run-length encoding (RLE) of $A$ and $B$, respectively. We present an algorithm for $\mathrm{D^2TW}$ that occupies $Θ(mN+nM)$ space and uses $O(m+n+\#_{\mathrm{chg}}) \subseteq O(mN + nM)$ time to update a compact differential representation $\mathit{DS}$ of the DP table per edit operation, where $\#_{\mathrm{chg}}$ denotes the number of cells in $\mathit{DS}$ whose values change after the edit operation. Our method is at least as efficient as the algorithm recently proposed by Froese et al. running in $Θ(mN + nM)$ time, and is faster when $\#_{\mathrm{chg}}$ is smaller than $O(mN + nM)$ which, as our preliminary experiments suggest, is likely to be the case in the majority of instances.

preprint2016arXiv

A hardness result and new algorithm for the longest common palindromic subsequence problem

The 2-LCPS problem, first introduced by Chowdhury et al. [Fundam. Inform., 129(4):329-340, 2014], asks one to compute (the length of) a longest palindromic common subsequence between two given strings $A$ and $B$. We show that the 2-LCPS problem is at least as hard as the well-studied longest common subsequence problem for four strings (the 4-LCS problem). Then, we present a new algorithm which solves the 2-LCPS problem in $O(σM^2 + n)$ time, where $n$ denotes the length of $A$ and $B$, $M$ denotes the number of matching positions between $A$ and $B$, and $σ$ denotes the number of distinct characters occurring in both $A$ and $B$. Our new algorithm is faster than Chowdhury et al.'s sparse algorithm when $σ= o(\log^2n \log\log n)$.

preprint2016arXiv

Deterministic sub-linear space LCE data structures with efficient construction

Given a string $S$ of $n$ symbols, a longest common extension query $\mathsf{LCE}(i,j)$ asks for the length of the longest common prefix of the $i$th and $j$th suffixes of $S$. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter $1 \leq τ\leq n$, their best deterministic solution is a data structure of size $O(n/τ)$ which allows LCE queries to be answered in $O(τ)$ time. However, the construction time for all deterministic versions of their data structure is quadratic in $n$. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of $O(τ\min\{\logτ,\log\frac{n}τ\})$ query time using $O(n/τ)$ space, but significantly improve the construction time to $O(nτ)$.

preprint2016arXiv

Dynamic index and LZ factorization in compressed space

In this paper, we propose a new \emph{dynamic compressed index} of $O(w)$ space for a dynamic text $T$, where $w = O(\min(z \log N \log^*M, N))$ is the size of the signature encoding of $T$, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, $N$ is the length of $T$, and $M \geq 3N$ is an integer that can be handled in constant time under word RAM model. Our index supports searching for a pattern $P$ in $T$ in $O(|P| f_{\mathcal{A}} + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. Also, we propose a new space-efficient LZ77 factorization algorithm for a given text of length $N$, which runs in $O(N f_{\mathcal{A}} + z \log w \log^3 N (\log^* N)^2)$ time with $O(w)$ working space.

preprint2016arXiv

Dynamic index, LZ factorization, and LCE queries in compressed space

In this paper, we present the following results: (1) We propose a new \emph{dynamic compressed index} of $O(w)$ space, that supports searching for a pattern $P$ in the current text in $O(|P| f(M,w) + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where $N$ is the length of the current text, $M$ is the maximum length of the dynamic text, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of the current text, $f(a,b) = O(\min \{ \frac{\log\log a \log\log b}{\log\log\log a}, \sqrt{\frac{\log b}{\log\log b}} \})$ and $w = O(z \log N \log^*M)$. (2) We propose a new space-efficient LZ77 factorization algorithm for a given text of length $N$, which runs in $O(N f(N,w') + z \log w' \log^3 N (\log^* N)^2)$ time with $O(w')$ working space, where $w' =O(z \log N \log^* N)$. (3) We propose a data structure of $O(w)$ space which supports longest common extension (LCE) queries on the text in $O(\log N + \log \ell \log^* N)$ time, where $\ell$ is the output LCE length. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.

preprint2016arXiv

Fully dynamic data structure for LCE queries in compressed space

A Longest Common Extension (LCE) query on a text $T$ of length $N$ asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding $\mathcal{G}$ of size $w = O(\min(z \log N \log^* M, N))$ [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of $T$, which can be seen as a compressed representation of $T$, has a capability to support LCE queries in $O(\log N + \log \ell \log^* M)$ time, where $\ell$ is the answer to the query, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, and $M \geq 4N$ is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover, $\mathcal{G}$ can be enhanced to support efficient update operations: After processing $\mathcal{G}$ in $O(w f_{\mathcal{A}})$ time, we can insert/delete any (sub)string of length $y$ into/from an arbitrary position of $T$ in $O((y+ \log N\log^* M) f_{\mathcal{A}})$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. This yields the first fully dynamic LCE data structure. We also present efficient construction algorithms from various types of inputs: We can construct $\mathcal{G}$ in $O(N f_{\mathcal{A}})$ time from uncompressed string $T$; in $O(n \log\log n \log N \log^* M)$ time from grammar-compressed string $T$ represented by a straight-line program of size $n$; and in $O(z f_{\mathcal{A}} \log N \log^* M)$ time from LZ77-compressed string $T$ with $z$ factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.

preprint2015arXiv

Constructing LZ78 Tries and Position Heaps in Linear Time for Large Alphabets

We present the first worst-case linear-time algorithm to compute the Lempel-Ziv 78 factorization of a given string over an integer alphabet. Our algorithm is based on nearest marked ancestor queries on the suffix tree of the given string. We also show that the same technique can be used to construct the position heap of a set of strings in worst-case linear time, when the set of strings is given as a trie.

preprint2015arXiv

Efficiently Finding All Maximal $α$-gapped Repeats

For $α\geq 1$, an $α$-gapped repeat in a word $w$ is a factor $uvu$ of $w$ such that $|uv|\leq α|u|$; the two factors $u$ in such a repeat are called arms, while the factor $v$ is called gap. Such a repeat is called maximal if its arms cannot be extended simultaneously with the same symbol to the right or, respectively, to the left. In this paper we show that the number of maximal $α$-gapped repeats that may occur in a word is upper bounded by $18αn$. This allows us to construct an algorithm finding all the maximal $α$-gapped repeats of a word in $O(αn)$; this is optimal, in the worst case, as there are words that have $Θ(αn)$ maximal $α$-gapped repeats. Our techniques can be extended to get comparable results in the case of $α$-gapped palindromes, i.e., factors $uvu^\mathrm{T}$ with $|uv|\leq α|u|$.

preprint2013arXiv

Computing convolution on grammar-compressed text

The convolution between a text string $S$ of length $N$ and a pattern string $P$ of length $m$ can be computed in $O(N \log m)$ time by FFT. It is known that various types of approximate string matching problems are reducible to convolution. In this paper, we assume that the input text string is given in a compressed form, as a \emph{straight-line program (SLP)}, which is a context free grammar in the Chomsky normal form that derives a single string. Given an SLP $\mathcal{S}$ of size $n$ describing a text $S$ of length $N$, and an uncompressed pattern $P$ of length $m$, we present a simple $O(nm \log m)$-time algorithm to compute the convolution between $S$ and $P$. We then show that this can be improved to $O(\min\{nm, N-α\} \log m)$ time, where $α\geq 0$ is a value that represents the amount of redundancy that the SLP captures with respect to the length-$m$ substrings. The key of the improvement is our new algorithm that computes the convolution between a trie of size $r$ and a pattern string $P$ of length $m$ in $O(r \log m)$ time.

preprint2013arXiv

Detecting regularities on grammar-compressed strings

We solve the problems of detecting and counting various forms of regularities in a string represented as a Straight Line Program (SLP). Given an SLP of size $n$ that represents a string $s$ of length $N$, our algorithm compute all runs and squares in $s$ in $O(n^3h)$ time and $O(n^2)$ space, where $h$ is the height of the derivation tree of the SLP. We also show an algorithm to compute all gapped-palindromes in $O(n^3h + gnh\log N)$ time and $O(n^2)$ space, where $g$ is the length of the gap. The key technique of the above solution also allows us to compute the periods and covers of the string in $O(n^2 h)$ time and $O(nh(n+\log^2 N))$ time, respectively.

preprint2013arXiv

Efficient Lyndon factorization of grammar compressed text

We present an algorithm for computing the Lyndon factorization of a string that is given in grammar compressed form, namely, a Straight Line Program (SLP). The algorithm runs in $O(n^4 + mn^3h)$ time and $O(n^2)$ space, where $m$ is the size of the Lyndon factorization, $n$ is the size of the SLP, and $h$ is the height of the derivation tree of the SLP. Since the length of the decompressed string can be exponentially large w.r.t. $n, m$ and $h$, our result is the first polynomial time solution when the string is given as SLP.

preprint2013arXiv

Faster Compact On-Line Lempel-Ziv Factorization

We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in $O(N\log N)$ time and uses only $O(N\logσ)$ bits of working space, where $N$ is the length of the string and $σ$ is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either $O(N\log^3N)$ time (Okanohara & Sadakane 2009) or $O(N\log^2N)$ time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size $m$ of a string of length $N$, computes the Lempel-Ziv factorization on-line, in $O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right)$ time and $O(m\log N)$ bits of space, which is faster and more space efficient when the string is run-length compressible.

preprint2013arXiv

Time and Space Efficient Lempel-Ziv Factorization based on Run Length Encoding

We propose a new approach for calculating the Lempel-Ziv factorization of a string, based on run length encoding (RLE). We present a conceptually simple off-line algorithm based on a variant of suffix arrays, as well as an on-line algorithm based on a variant of directed acyclic word graphs (DAWGs). Both algorithms run in $O(N+n\log n)$ time and O(n) extra space, where N is the size of the string, $n\leq N$ is the number of RLE factors. The time dependency on N is only in the conversion of the string to RLE, which can be computed very efficiently in O(N) time and O(1) extra space (excluding the output). When the string is compressible via RLE, i.e., $n = o(N)$, our algorithms are, to the best of our knowledge, the first algorithms which require only o(N) extra space while running in $o(N\log N)$ time.

preprint2012arXiv

Efficient LZ78 factorization of grammar compressed text

We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size $n$ representing a text $S$ of length $N$, our algorithm computes the LZ78 factorization of $T$ in $O(n\sqrt{N}+m\log N)$ time and $O(n\sqrt{N}+m)$ space, where $m$ is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the $n\sqrt{N}$ term in the time and space complexities becomes either $nL$, where $L$ is the length of the longest LZ78 factor, or $(N - α)$ where $α\geq 0$ is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of $S$ of a certain length. Since $m = O(N/\log_σN)$ where $σ$ is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when $σ$ is constant, and can be more efficient when the text is compressible, i.e. when $m$ and $n$ are small.

preprint2012arXiv

Speeding-up $q$-gram mining on grammar-based compressed texts

We present an efficient algorithm for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size $n$ that represents string $T$, the algorithm computes the occurrence frequencies of all $q$-grams in $T$, by reducing the problem to the weighted $q$-gram frequencies problem on a trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where $\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of redundancy that the SLP captures with respect to $q$-grams. The reduced problem can be solved in linear time. Since $m = O(qn)$, the running time of our algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our previous $O(qn)$ algorithm when $q = Ω(|T|/n)$.

preprint2011arXiv

Computing q-gram Frequencies on Collage Systems

Collage systems are a general framework for representing outputs of various text compression algorithms. We consider the all $q$-gram frequency problem on compressed string represented as a collage system, and present an $O((q+h\log n)n)$-time $O(qn)$-space algorithm for calculating the frequencies for all $q$-grams that occur in the string. Here, $n$ and $h$ are respectively the size and height of the collage system.

preprint2011arXiv

Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Length-$q$ substrings, or $q$-grams, can represent important characteristics of text data, and determining the frequencies of all $q$-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the {\em non-overlapping frequencies} of all $q$-grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in $O(q^2n)$ time and $O(qn)$ space where $n$ is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for $q=2$ in $O(n^4\log n)$ time and $O(n^3)$ space.

preprint2011arXiv

Fast $q$-gram Mining on SLP Compressed Strings

We present simple and efficient algorithms for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size $n$ that represents string $T$, we present an $O(qn)$ time and space algorithm that computes the occurrence frequencies of $q$-grams in $T$. Computational experiments show that our algorithm and its variation are practical for small $q$, actually running faster on various real string data, compared to algorithms that work on the uncompressed text. We also discuss applications in data mining and classification of string data, for which our algorithms can be useful.

preprint2011arXiv

Restructuring Compressed Texts without Explicit Decompression

We consider the problem of {\em restructuring} compressed texts without explicit decompression. We present algorithms which allow conversions from compressed representations of a string $T$ produced by any grammar-based compression algorithm, to representations produced by several specific compression algorithms including LZ77, LZ78, run length encoding, and some grammar based compression algorithms. These are the first algorithms that achieve running times polynomial in the size of the compressed input and output representations of $T$. Since most of the representations we consider can achieve exponential compression, our algorithms are theoretically faster in the worst case, than any algorithm which first decompresses the string for the conversion.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Data Structures and Algorithms Discrete Mathematics Databases math.CO Formal Languages and Automata Theory

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2410.06837:author:1:shunsuke-inenaga

Imported May 21, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2402.19146:author:6:shunsuke-inenaga

Imported May 21, 2026Synced May 21, 2026

27 works

Hideo Bannai

Researcher

Hideo Bannai contributes to research discovery and scholarly infrastructure.

Open to collaborate

26 works

Masayuki Takeda

Researcher

Masayuki Takeda contributes to research discovery and scholarly infrastructure.

Open to collaborate

14 works

Yuto Nakashima

Researcher

Yuto Nakashima contributes to research discovery and scholarly infrastructure.

Open to collaborate

9 works

Tomohiro I

Researcher

Tomohiro I contributes to research discovery and scholarly infrastructure.

Open to collaborate

Shunsuke Inenaga

What is connected

Connect this record

See the researcher in context

Building this map preview

37 published item(s)

Faster and Simpler Online Computation of String Net Frequency

Subsequence Matching and LCS under Cartesian-Tree Equivalence

A faster reduction of the dynamic time warping distance to the longest increasing subsequence length

Cartesian Tree Subsequence Matching

Combinatorics of minimal absent words for a sliding window

Minimal Absent Words on Run-Length Encoded Strings

RePair Grammars are the Smallest Grammars for Fibonacci Words

Towards a complete perspective on labeled tree indexing: new size bounds, efficient constructions, and beyond

Computing longest palindromic substring after single-character or block-wise edits

The Parameterized Suffix Tray

Detecting $k$-(Sub-)Cadences and Equidistant Subsequence Occurrences

Fast Algorithms for the Shortest Unique Palindromic Substring Problem on Run-Length Encoded Strings

Faster STR-EC-LCS Computation

Grammar-compressed Self-index with Lyndon Words

Longest Square Subsequence Problem Revisited

On repetitiveness measures of Thue-Morse words

Pointer-Machine Algorithms for Fully-Online Construction of Suffix Trees and DAWGs on Multiple Strings

Space-Efficient Algorithms for Computing Minimal/Shortest Unique Substrings

Towards Efficient Interactive Computation of Dynamic Time Warping Distance

A hardness result and new algorithm for the longest common palindromic subsequence problem

Deterministic sub-linear space LCE data structures with efficient construction

Dynamic index and LZ factorization in compressed space

Dynamic index, LZ factorization, and LCE queries in compressed space

Fully dynamic data structure for LCE queries in compressed space

Constructing LZ78 Tries and Position Heaps in Linear Time for Large Alphabets

Efficiently Finding All Maximal $α$-gapped Repeats

Computing convolution on grammar-compressed text

Detecting regularities on grammar-compressed strings

Efficient Lyndon factorization of grammar compressed text

Faster Compact On-Line Lempel-Ziv Factorization

Time and Space Efficient Lempel-Ziv Factorization based on Run Length Encoding

Efficient LZ78 factorization of grammar compressed text

Speeding-up $q$-gram mining on grammar-based compressed texts

Computing q-gram Frequencies on Collage Systems

Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Fast $q$-gram Mining on SLP Compressed Strings

Restructuring Compressed Texts without Explicit Decompression