Source author record

Keisuke Goto

Keisuke Goto appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Databases

Catalog footprint

What is connected

10works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Computing NP-hard Repetitiveness Measures via MAX-SAT

Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors.

preprint2021arXiv

In-Place Initializable Arrays

An initializable array is an array that supports the read and write operations for any element and the initialization of the entire array. This paper proposes a simple in-place algorithm to implement an initializable array of length $N$ containing $\ell \in O(w)$ bits entries in $N \ell +1$ bits on the word RAM model with $w$ bits word size, i.e., the proposed array requires only 1 extra bit on top of a normal array of length $N$ containing $\ell$ bits entries. Our algorithm supports the all three operations in constant worst-case time, that is, it runs in-place using at most constant number of words $O(w)$ bits during each operation. The time and space complexities are optimal since it was already proven that there is no implementation of an initializable array with no extra bit supporting all the operations in constant worst-case time [Hagerup and Kammer, ISAAC 2017]. Our algorithm significantly improves upon the best algorithm presented in the earlier studies [Navarro, CSUR 2014] which uses $N + o(N)$ extra bits to support all the operations in constant worst-case time.

preprint2020arXiv

Efficient Constrained Pattern Mining Using Dynamic Item Ordering for Explainable Classification

Learning of interpretable classification models has been attracting much attention for the last few years. Discovery of succinct and contrasting patterns that can highlight the differences between the two classes is very important. Such patterns are useful for human experts, and can be used to construct powerful classifiers. In this paper, we consider mining of minimal emerging patterns from high-dimensional data sets under a variety of constraints in a supervised setting. We focus on an extension in which patterns can contain negative items that designate the absence of an item. In such a case, a database becomes highly dense, and it makes mining more challenging since popular pattern mining techniques such as fp-tree and occurrence deliver do not efficiently work. To cope with this difficulty, we present an efficient algorithm for mining minimal emerging patterns by combining two techniques: dynamic variable-ordering during pattern search for enhancing pruning effect, and the use of a pointer-based dynamic data structure, called dancing links, for efficiently maintaining occurrence lists. Experiments on benchmark data sets showed that our algorithm achieves significant speed-ups over emerging pattern mining approach based on LCM, a very fast depth-first frequent itemset miner using static variable-ordering.

preprint2013arXiv

Simpler and Faster Lempel Ziv Factorization

We present a new, simple, and efficient approach for computing the Lempel-Ziv (LZ77) factorization of a string in linear time, based on suffix arrays. Computational experiments on various data sets show that our approach constantly outperforms the currently fastest algorithm LZ OG (Ohlebusch and Gog 2011), and can be up to 2 to 3 times faster in the processing after obtaining the suffix array, while requiring the same or a little more space.

preprint2013arXiv

Space Efficient Linear Time Lempel-Ziv Factorization on Constant~Size~Alphabets

We present a new algorithm for computing the Lempel-Ziv Factorization (LZ77) of a given string of length $N$ in linear time, that utilizes only $N\log N + O(1)$ bits of working space, i.e., a single integer array, for constant size integer alphabets. This greatly improves the previous best space requirement for linear time LZ77 factorization (Kärkkäinen et al. CPM 2013), which requires two integer arrays of length $N$. Computational experiments show that despite the added complexity of the algorithm, the speed of the algorithm is only around twice as slow as previous fastest linear time algorithms.

preprint2012arXiv

Speeding-up $q$-gram mining on grammar-based compressed texts

We present an efficient algorithm for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size $n$ that represents string $T$, the algorithm computes the occurrence frequencies of all $q$-grams in $T$, by reducing the problem to the weighted $q$-gram frequencies problem on a trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where $\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of redundancy that the SLP captures with respect to $q$-grams. The reduced problem can be solved in linear time. Since $m = O(qn)$, the running time of our algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our previous $O(qn)$ algorithm when $q = Ω(|T|/n)$.

preprint2011arXiv

Computing q-gram Frequencies on Collage Systems

Collage systems are a general framework for representing outputs of various text compression algorithms. We consider the all $q$-gram frequency problem on compressed string represented as a collage system, and present an $O((q+h\log n)n)$-time $O(qn)$-space algorithm for calculating the frequencies for all $q$-grams that occur in the string. Here, $n$ and $h$ are respectively the size and height of the collage system.

preprint2011arXiv

Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Length-$q$ substrings, or $q$-grams, can represent important characteristics of text data, and determining the frequencies of all $q$-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the {\em non-overlapping frequencies} of all $q$-grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in $O(q^2n)$ time and $O(qn)$ space where $n$ is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for $q=2$ in $O(n^4\log n)$ time and $O(n^3)$ space.

preprint2011arXiv

Fast $q$-gram Mining on SLP Compressed Strings

We present simple and efficient algorithms for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size $n$ that represents string $T$, we present an $O(qn)$ time and space algorithm that computes the occurrence frequencies of $q$-grams in $T$. Computational experiments show that our algorithm and its variation are practical for small $q$, actually running faster on various real string data, compared to algorithms that work on the uncompressed text. We also discuss applications in data mining and classification of string data, for which our algorithms can be useful.

preprint2011arXiv

Restructuring Compressed Texts without Explicit Decompression

We consider the problem of {\em restructuring} compressed texts without explicit decompression. We present algorithms which allow conversions from compressed representations of a string $T$ produced by any grammar-based compression algorithm, to representations produced by several specific compression algorithms including LZ77, LZ78, run length encoding, and some grammar based compression algorithms. These are the first algorithms that achieve running times polynomial in the size of the compressed input and output representations of $T$. Since most of the representations we consider can achieve exponential compression, our algorithms are theoretically faster in the worst case, than any algorithm which first decompresses the string for the conversion.

Keisuke Goto

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Computing NP-hard Repetitiveness Measures via MAX-SAT

In-Place Initializable Arrays

Efficient Constrained Pattern Mining Using Dynamic Item Ordering for Explainable Classification

Simpler and Faster Lempel Ziv Factorization

Space Efficient Linear Time Lempel-Ziv Factorization on Constant~Size~Alphabets

Speeding-up $q$-gram mining on grammar-based compressed texts

Computing q-gram Frequencies on Collage Systems

Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Fast $q$-gram Mining on SLP Compressed Strings

Restructuring Compressed Texts without Explicit Decompression