Source author record

Ryo Yoshinaka

Ryo Yoshinaka appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Formal Languages and Automata Theory Information Retrieval Machine Learning

Catalog footprint

What is connected

7works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Learning Deterministic Finite-State Machines from the Prefixes of a Single String is NP-Complete

It is well known that computing a minimum DFA consistent with a given set of positive and negative examples is NP-hard. Previous work has identified conditions on the input sample under which the problem becomes tractable or remains hard. In this paper, we study the computational complexity of the case where the input sample is prefix-closed. This formulation is equivalent to computing a minimum Moore machine consistent with observations along its runs. We show that the problem is NP-hard to approximate when the sample set consists of all prefixes of binary strings. Furthermore, we show that the problem remains NP-hard as a decision problem even when the sample set consists of the prefixes of a single binary string. Our argument also extends to the corresponding problem for Mealy machines.

preprint2022arXiv

Computing the Parameterized Burrows--Wheeler Transform Online

Parameterized strings are a generalization of strings in that their characters are drawn from two different alphabets, where one is considered to be the alphabet of static characters and the other to be the alphabet of parameter characters. Two parameterized strings are a parameterized match if there is a bijection over all characters such that the bijection transforms one string to the other while keeping the static characters (i.e., it behaves as the identity on the static alphabet). Ganguly et al. [SODA 2017] proposed the parameterized Burrows--Wheeler transform (pBWT) as a variant of the Burrows--Wheeler transform for space-efficient parameterized pattern matching. In this paper, we propose an algorithm for computing the pBWT online by reading the characters of a given input string one-by-one from right to left. Our algorithm works in $O(|Π| \log n / \log \log n)$ amortized time for each input character, where $n$ and $Π$ denote the size of the input string and the alphabet of the parameter characters, respectively.

preprint2022arXiv

Parallel algorithm for pattern matching problems under substring consistent equivalence relations

Given a text and a pattern over an alphabet, the pattern matching problem searches for all occurrences of the pattern in the text. An equivalence relation $\approx$ is called a substring consistent equivalence relation (SCER), if for two strings $X$ and $Y$, $X \approx Y$ implies $|X| = |Y|$ and $X[i:j] \approx Y[i:j]$ for all $1 \le i \le j \le |X|$. In this paper, we propose an efficient parallel algorithm for pattern matching under any SCER using the"duel-and-sweep" paradigm. For a pattern of length $m$ and a text of length $n$, our algorithm runs in $O(ξ_m^\mathrm{t} \log^2 m)$ time and $O(ξ_m^\mathrm{w} \cdot n \log^2 m)$ work, with $O(τ_n^\mathrm{t} + ξ_m^\mathrm{t} \log^2 m)$ time and $O(τ_n^\mathrm{w} + ξ_m^\mathrm{w} \cdot m \log^2 m)$ work preprocessing on the Priority Concurrent Read Concurrent Write Parallel Random-Access Machines (P-CRCW PRAM).

preprint2020arXiv

Computing Covers under Substring Consistent Equivalence Relations

Covers are a kind of quasiperiodicity in strings. A string $C$ is a cover of another string $T$ if any position of $T$ is inside some occurrence of $C$ in $T$. The shortest and longest cover arrays of $T$ have the lengths of the shortest and longest covers of each prefix of $T$, respectively. The literature has proposed linear-time algorithms computing longest and shortest cover arrays taking border arrays as input. An equivalence relation $\approx$ over strings is called a substring consistent equivalence relation (SCER) iff $X \approx Y$ implies (1) $|X| = |Y|$ and (2) $X[i:j] \approx Y[i:j]$ for all $1 \le i \le j \le |X|$. In this paper, we generalize the notion of covers for SCERs and prove that existing algorithms to compute the shortest cover array and the longest cover array of a string $T$ under the identity relation will work for any SCERs taking the accordingly generalized border arrays.

preprint2020arXiv

Fast and linear-time string matching algorithms based on the distances of $q$-gram occurrences

Given a text $T$ of length $n$ and a pattern $P$ of length $m$, the string matching problem is a task to find all occurrences of $P$ in $T$. In this study, we propose an algorithm that solves this problem in $O((n + m)q)$ time considering the distance between two adjacent occurrences of the same $q$-gram contained in $P$. We also propose a theoretical improvement of it which runs in $O(n + m)$ time, though it is not necessarily faster in practice. We compare the execution times of our and existing algorithms on various kinds of real and artificial datasets such as an English text, a genome sequence and a Fibonacci string. The experimental results show that our algorithm is as fast as the state-of-the-art algorithms in many cases, particularly when a pattern frequently appears in a text.

preprint2020arXiv

Grammar compression with probabilistic context-free grammar

We propose a new approach for universal lossless text compression, based on grammar compression. In the literature, a target string $T$ has been compressed as a context-free grammar $G$ in Chomsky normal form satisfying $L(G) = \{T\}$. Such a grammar is often called a \emph{straight-line program} (SLP). In this paper, we consider a probabilistic grammar $G$ that generates $T$, but not necessarily as a unique element of $L(G)$. In order to recover the original text $T$ unambiguously, we keep both the grammar $G$ and the derivation tree of $T$ from the start symbol in $G$, in compressed form. We show some simple evidence that our proposal is indeed more efficient than SLPs for certain texts, both from theoretical and practical points of view.

preprint2016arXiv

Micro-Clustering: Finding Small Clusters in Large Diversity

We address the problem of un-supervised soft-clustering called micro-clustering. The aim of the problem is to enumerate all groups composed of records strongly related to each other, while standard clustering methods separate records at sparse parts. The problem formulation of micro-clustering is non-trivial. Clique mining in a similarity graph is a typical approach, but it results in a huge number of cliques that are of many similar cliques. We propose a new concept data polishing. The cause of huge solutions can be considered that the groups are not clear in the data, that is, the boundaries of the groups are not clear, because of noise, uncertainty, lie, lack, etc. Data polishing clarifies the groups by perturbating the data. Specifically, dense subgraphs that would correspond to clusters are replaced by cliques. The clusters are clarified as maximal cliques, thus the number of maximal cliques will be drastically reduced. We also propose an efficient algorithm applicable even for large scale data. Computational experiments showed the efficiency of our algorithm, i.e., the number of solutions is small, (e.g., 1,000), the members of each group are deeply related, and the computation time is short.

Ryo Yoshinaka

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Learning Deterministic Finite-State Machines from the Prefixes of a Single String is NP-Complete

Computing the Parameterized Burrows--Wheeler Transform Online

Parallel algorithm for pattern matching problems under substring consistent equivalence relations

Computing Covers under Substring Consistent Equivalence Relations

Fast and linear-time string matching algorithms based on the distances of $q$-gram occurrences

Grammar compression with probabilistic context-free grammar

Micro-Clustering: Finding Small Clusters in Large Diversity