Source author record

Solon P. Pissis

Solon P. Pissis appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Artificial Intelligence Formal Languages and Automata Theory Neural and Evolutionary Computing

Catalog footprint

What is connected

17works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Text Indexing and Pattern Matching with Ephemeral Edits

A sequence $e_0,e_1,\ldots$ of edit operations in a string $T$ is called ephemeral if operation $e_i$ constructing string $T^i$, for all $i=2k$ with $k\in\mathbb{N}$, is reverted by operation $e_{i+1}$ that reconstructs $T$. Such a sequence arises when processing a stream of independent edits or testing hypothetical edits. We introduce text indexing with ephemeral substring edits, a new version of text indexing. Our goal is to design a data structure over a given text that supports subsequent pattern matching queries with ephemeral substring insertions, deletions, or substitutions in the text; we require insertions and substitutions to be of constant length. In particular, we preprocess a text $T=T[0\mathinner{.\,.} n)$ over an integer alphabet $Σ=[0,σ)$ with $σ=n^{\mathcal{O}(1)}$ in $\mathcal{O}(n)$ time. Then, we can preprocess any arbitrary pattern $P=P[0\mathinner{.\,.} m)$ given online in $\mathcal{O}(m\log\log m)$ time and $\mathcal{O}(m)$ space and allow any ephemeral sequence of edit operations in $T$. Before reverting the $i$th operation, we report all Occ occurrences of $P$ in $T^i$ in $\mathcal{O}(\log\log n + \text{Occ})$ time. We also introduce pattern matching with ephemeral edits. In particular, we preprocess two strings $T$ and $P$, each of length at most $n$, over an integer alphabet $Σ=[0,σ)$ with $σ=n^{\mathcal{O}(1)}$ in $\mathcal{O}(n)$ time. Then, we allow any ephemeral sequence of edit operations in $T$. Before reverting the $i$th operation, we report all Occ occurrences of $P$ in $T^i$ in the optimal $\mathcal{O}(\text{Occ})$ time. Along our way to this result, we also give an optimal solution for pattern matching with ephemeral block deletions.

preprint2022arXiv

Elastic-Degenerate String Matching with 1 Error

An elastic-degenerate string is a sequence of $n$ finite sets of strings of total length $N$, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length $m$ in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culminating in an $\tilde{\mathcal{O}}(nm^{ω-1})+\mathcal{O}(N)$-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where $ω$ denotes the matrix multiplication exponent and the $\tilde{\mathcal{O}}(\cdot)$ notation suppresses polylog factors. In the $k$-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most $k$ errors. $k$-EDSM can be solved in $\mathcal{O}(k^2mG+kN)$ time, under edit distance, or $\mathcal{O}(kmG+kN)$ time, under Hamming distance, where $G$ denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, $G$ is only bounded by $N$, and so even for $k=1$, the existing algorithms run in $Ω(mN)$ time in the worst case. In this paper we show that $1$-EDSM can be solved in $\mathcal{O}((nm^2 + N)\log m)$ or $\mathcal{O}(nm^3 + N)$ time under edit distance. For the decision version, we present a faster $\mathcal{O}(nm^2\sqrt{\log m} + N\log\log m)$-time algorithm. We also show that $1$-EDSM can be solved in $\mathcal{O}(nm^2 + N\log m)$ time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from $1$-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the $k$-errata trees for indexing with errors [Cole et al., STOC 2004].

preprint2022arXiv

Symbolic Regression is NP-hard

Symbolic regression (SR) is the task of learning a model of data in the form of a mathematical expression. By their nature, SR models have the potential to be accurate and human-interpretable at the same time. Unfortunately, finding such models, i.e., performing SR, appears to be a computationally intensive task. Historically, SR has been tackled with heuristics such as greedy or genetic algorithms and, while some works have hinted at the possible hardness of SR, no proof has yet been given that SR is, in fact, NP-hard. This begs the question: Is there an exact polynomial-time algorithm to compute SR models? We provide evidence suggesting that the answer is probably negative by showing that SR is NP-hard.

preprint2020arXiv

Circular Pattern Matching with $k$ Mismatches

The $k$-mismatch problem consists in computing the Hamming distance between a pattern $P$ of length $m$ and every length-$m$ substring of a text $T$ of length $n$, if this distance is no more than $k$. In many real-world applications, any cyclic rotation of $P$ is a relevant pattern, and thus one is interested in computing the minimal distance of every length-$m$ substring of $T$ and any cyclic rotation of $P$. This is the circular pattern matching with $k$ mismatches ($k$-CPM) problem. A multitude of papers have been devoted to solving this problem but, to the best of our knowledge, only average-case upper bounds are known. In this paper, we present the first non-trivial worst-case upper bounds for the $k$-CPM problem. Specifically, we show an $O(nk)$-time algorithm and an $O(n+\frac{n}{m}\,k^4)$-time algorithm. The latter algorithm applies in an extended way a technique that was very recently developed for the $k$-mismatch problem [Bringmann et al., SODA 2019]. A preliminary version of this work appeared at FCT 2019. In this version we improve the time complexity of the main algorithm from $O(n+\frac{n}{m}\,k^5)$ to $O(n+\frac{n}{m}\,k^4)$.

preprint2019arXiv

Combinatorial Algorithms for String Sanitization

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge. In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by ``reversing'' the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching.

preprint2016arXiv

Average-Case Optimal Approximate Circular String Matching

Approximate string matching is the problem of finding all factors of a text t of length n that are at a distance at most k from a pattern x of length m. Approximate circular string matching is the problem of finding all factors of t that are at a distance at most k from x or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time O(n(k + log m)/m). Optimal average-case search time can also be achieved by the algorithms for multiple approximate string matching (Fredriksson and Navarro, 2004) using x and its rotations as the set of multiple patterns. Here we reduce the preprocessing time and space requirements compared to that approach.

preprint2016arXiv

Efficient Index for Weighted Sequences

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z \log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.

preprint2016arXiv

Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the runs in a string can be computed via a linear number of LCE queries. The first to consider these problems over a general ordered alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an $O(n (\log n)^{2/3})$-time algorithm for answering $O(n)$ LCE queries. This result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to $O(n \log \log n)$ time. In this work we note a special \emph{non-crossing} property of LCE queries asked in the runs computation. We show that any $n$ such non-crossing queries can be answered on-line in $O(n α(n))$ time, which yields an $O(n α(n))$-time algorithm for computing runs.

preprint2016arXiv

Optimal Computation of Avoided Words

The deviation of the observed frequency of a word $w$ from its expected frequency in a given sequence $x$ is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of $w$, denoted by $std(w)$, effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word $w$ of length $k>2$ is a $ρ$-avoided word in $x$ if $std(w) \leq ρ$, for a given threshold $ρ< 0$. Notice that such a word may be completely absent from $x$. Hence computing all such words na\"ıvely can be a very time-consuming procedure, in particular for large $k$. In this article, we propose an $O(n)$-time and $O(n)$-space algorithm to compute all $ρ$-avoided words of length $k$ in a given sequence $x$ of length $n$ over a fixed-sized alphabet. We also present a time-optimal $O(σn)$-time and $O(σn)$-space algorithm to compute all $ρ$-avoided words (of any length) in a sequence of length $n$ over an alphabet of size $σ$. Furthermore, we provide a tight asymptotic upper bound for the number of $ρ$-avoided words and the expected length of the longest one. We make available an open-source implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation.

preprint2016arXiv

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterized by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem.

preprint2015arXiv

Fast Average-Case Pattern Matching on Weighted Sequences

A weighted string over an alphabet of size $σ$ is a string in which a set of letters may occur at each position with respective occurrence probabilities. Weighted strings, also known as position weight matrices or uncertain sequences, naturally arise in many contexts. In this article, we study the problem of weighted string matching with a special focus on average-case analysis. Given a weighted pattern string $x$ of length $m$, a text string $y$ of length $n>m$, and a cumulative weight threshold $1/z$, defined as the minimal probability of occurrence of factors in a weighted string, we present an algorithm requiring average-case search time $o(n)$ for pattern matching for weight ratio $\frac{z}{m} < \min\{\frac{1}{\log z},\frac{\log σ}{\log z (\log m + \log \log σ)}\}$. For a pattern string $x$ of length $m$, a weighted text string $y$ of length $n>m$, and a cumulative weight threshold $1/z$, we present an algorithm requiring average-case search time $o(σn)$ for the same weight ratio. The importance of these results lies on the fact that these algorithms work in average-case sublinear search time in the size of the text, and in linear preprocessing time and space in the size of the pattern, for these ratios.

preprint2015arXiv

Linear-Time Sequence Comparison Using Minimal Absent Words & Applications

Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realized by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as $q$-gram distance, are usually computed in time linear with respect to the length of the sequences. In this article, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an {\em absent word} of some sequence if it does not occur in the sequence. An absent word is {\em minimal} if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering {\em all} their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences.

preprint2015arXiv

Linear-Time Superbubble Identification Algorithm for Genome Assembly

DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual's genome in order to catalogue sequence variation and understand its biological implications. Whole-genome sequencing techniques produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly algorithms utilize de Bruijn graphs constructed from reads for this purpose. A critical step of these algorithms is to detect typical motif structures in the graph caused by sequencing errors and genome repeats, and filter them out; one such complex subgraph class is a so-called superbubble. In this paper, we propose an O(n+m)-time algorithm to detect all superbubbles in a directed acyclic graph with n nodes and m (directed) edges, improving the best-known O(m log m)-time algorithm by Sung et al.

preprint2014arXiv

Linear-time Computation of Minimal Absent Words Using Suffix Array

An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n)-time and O(n)-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n^2)-time and O(n)-space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available. In this article, we bridge this unpleasant gap by presenting an O(n)-time and O(n)-space algorithm for computing all minimal absent words based on the construction of suffix arrays. Experimental results using real and synthetic data show that the respective implementation outperforms the one by Pinho et al.

preprint2013arXiv

Fast Algorithm for Partial Covers in Words

A factor $u$ of a word $w$ is a cover of $w$ if every position in $w$ lies within some occurrence of $u$ in $w$. A word $w$ covered by $u$ thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of $u$. In this article we introduce a new notion of $α$-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least $α$ positions in $w$. We develop a data structure of $O(n)$ size (where $n=|w|$) that can be constructed in $O(n\log n)$ time which we apply to compute all shortest $α$-partial covers for a given $α$. We also employ it for an $O(n\log n)$-time algorithm computing a shortest $α$-partial cover for each $α=1,2,\ldots,n$.

preprint2013arXiv

Order-Preserving Suffix Trees and Their Algorithmic Applications

Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this result to pattern matching with multiple patterns. We make one step forward in the analysis and give an $O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix trees in the order-preserving setting. We show a number of applications of order-preserving suffix trees to identify patterns and repetitions in time series.

preprint2011arXiv

Efficient Seeds Computation Revisited

The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the paper we give linear time algorithms for some of its versions --- computing shortest left-seed array, longest left-seed array and checking for seeds of a given length. The algorithm for the last problem is used to compute the seed array of a string (i.e., the shortest seeds for all the prefixes of the string) in $O(n^2)$ time. We describe also a simpler alternative algorithm computing efficiently the shortest seeds. As a by-product we obtain an $O(n\log{(n/m)})$ time algorithm checking if the shortest seed has length at least $m$ and finding the corresponding seed. We also correct some important details missing in the previously known shortest-seed algorithm (Iliopoulos et al., 1996).

Solon P. Pissis

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Text Indexing and Pattern Matching with Ephemeral Edits

Elastic-Degenerate String Matching with 1 Error

Symbolic Regression is NP-hard

Circular Pattern Matching with $k$ Mismatches

Combinatorial Algorithms for String Sanitization

Average-Case Optimal Approximate Circular String Matching

Efficient Index for Weighted Sequences

Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

Optimal Computation of Avoided Words

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

Fast Average-Case Pattern Matching on Weighted Sequences

Linear-Time Sequence Comparison Using Minimal Absent Words & Applications

Linear-Time Superbubble Identification Algorithm for Genome Assembly

Linear-time Computation of Minimal Absent Words Using Suffix Array

Fast Algorithm for Partial Covers in Words

Order-Preserving Suffix Trees and Their Algorithmic Applications

Efficient Seeds Computation Revisited