Source author record

Emanuele Giaquinta

Emanuele Giaquinta appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Formal Languages and Automata Theory

Catalog footprint

What is connected

10works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Technology Beats Algorithms (in Exact String Matching)

More than 120 algorithms have been developed for exact string matching within the last 40 years. We show by experiments that the \naive{} algorithm exploiting SIMD instructions of modern CPUs (with symbols compared in a special order) is the fastest one for patterns of length up to about 50 symbols and extremely good for longer patterns and small alphabets. The algorithm compares 16 or 32 characters in parallel by applying SSE2 or AVX2 instructions, respectively. Moreover, it uses loop peeling to further speed up the searching phase. We tried several orders for comparisons of pattern symbols and the increasing order of their probabilities in the text was the best.

preprint2015arXiv

Longest common substrings with k mismatches

The longest common substring with $k$-mismatches problem is to find, given two strings $S_1$ and $S_2$, a longest substring $A_1$ of $S_1$ and $A_2$ of $S_2$ such that the Hamming distance between $A_1$ and $A_2$ is $\le k$. We introduce a practical $O(nm)$ time and $O(1)$ space solution for this problem, where $n$ and $m$ are the lengths of $S_1$ and $S_2$, respectively. This algorithm can also be used to compute the matching statistics with $k$-mismatches of $S_1$ and $S_2$ in $O(nm)$ time and $O(m)$ space. Moreover, we also present a theoretical solution for the $k = 1$ case which runs in $O(n \log m)$ time, assuming $m\le n$, and uses $O(m)$ space, improving over the existing $O(nm)$ time and $O(m)$ space bound of Babenko and Starikovskaya.

preprint2014arXiv

Alternative Algorithms for Lyndon Factorization

We present two variations of Duval's algorithm for computing the Lyndon factorization of a word. The first algorithm is designed for the case of small alphabets and is able to skip a significant portion of the characters of the string, for strings containing runs of the smallest character in the alphabet. Experimental results show that it is faster than Duval's original algorithm, more than ten times in the case of long DNA strings. The second algorithm computes, given a run-length encoded string $R$ of length $ρ$, the Lyndon factorization of $R$ in $O(ρ)$ time and constant space.

preprint2014arXiv

Motif matching using gapped patterns

We present new algorithms for the problem of multiple string matching of gapped patterns, where a gapped pattern is a sequence of strings such that there is a gap of fixed length between each two consecutive strings. The problem has applications in the discovery of transcription factor binding sites in DNA sequences when using generalized versions of the Position Weight Matrix model to describe transcription factor specificities. In these models a motif can be matched as a set of gapped patterns with unit-length keywords. The existing algorithms for matching a set of gapped patterns are worst-case efficient but not practical, or vice versa, in this particular case. The novel algorithms that we present are based on dynamic programming and bit-parallelism, and lie in a middle-ground among the existing algorithms. In fact, their time complexity is close to the best existing bound and, yet, they are also practical. We also provide experimental results which show that the presented algorithms are fast in practice, and preferable if all the strings in the patterns have unit-length.

preprint2014arXiv

Run-Length Encoded Nondeterministic KMP and Suffix Automata

We present a novel bit-parallel representation, based on the run-length encoding, of the nondeterministic KMP and suffix automata for a string $P$ with at least two distinct symbols. Our method is targeted to the case of long strings over small alphabets and complements the method of Cantone et al. (2012), which is effective for long strings over large alphabets. Our encoding requires $O((σ+ m)\lceil ρ/ w\rceil)$ space and allows one to simulate the automata on a string in time $O(\lceil ρ/ w\rceil)$ per transition, where $σ$ is the alphabet size, $m$ is the length of $P$, $ρ$ is the length of the run-length encoding of $P$ and $w$ is the machine word size in bits. The input string can be given in either unencoded or run-length encoded form.

preprint2013arXiv

Approximate pattern matching with k-mismatches in packed text

Given strings $P$ of length $m$ and $T$ of length $n$ over an alphabet of size $σ$, the string matching with $k$-mismatches problem is to find the positions of all the substrings in $T$ that are at Hamming distance at most $k$ from $P$. If $T$ can be read only one character at the time the best known bounds are $O(n\sqrt{k\log k})$ and $O(n + n\sqrt{k/w}\log k)$ in the word-RAM model with word length $w$. In the RAM models (including $AC^0$ and word-RAM) it is possible to read up to $\floor{w / \log σ}$ characters in constant time if the characters of $T$ are encoded using $\ceil{\log σ}$ bits. The only solution for $k$-mismatches in packed text works in $O((n \logσ/\log n)\ceil{m \log (k + \log n / \logσ) / w} + n^{\varepsilon})$ time, for any $\varepsilon > 0$. We present an algorithm that runs in time $O(\frac{n}{\floor{w/(m\logσ)}} (1 + \log \min(k,σ) \log m / \logσ))$ in the $AC^0$ model if $m=O(w / \logσ)$ and $T$ is given packed. We also describe a simpler variant that runs in time $O(\frac{n}{\floor{w/(m\logσ)}}\log \min(m, \log w / \logσ))$ in the word-RAM model. The algorithms improve the existing bound for $w = Ω(\log^{1+ε}n)$, for any $ε> 0$. Based on the introduced technique, we present algorithms for several other approximate matching problems.

preprint2013arXiv

Indexes for Jumbled Pattern Matching in Strings, Trees and Graphs

We consider how to index strings, trees and graphs for jumbled pattern matching when we are asked to return a match if one exists. For example, we show how, given a tree containing two colours, we can build a quadratic-space index with which we can find a match in time proportional to the size of the match. We also show how we need only linear space if we are content with approximate matches.

preprint2013arXiv

New algorithms for binary jumbled pattern matching

Given a pattern $P$ and a text $T$, both strings over a binary alphabet, the binary jumbled string matching problem consists in telling whether any permutation of $P$ occurs in $T$. The indexed version of this problem, i.e., preprocessing a string to efficiently answer such permutation queries, is hard and has been studied in the last few years. Currently the best bounds for this problem are $O(n^2/\log^2 n)$ (with O(n) space and O(1) query time) and $O(r^2\log r)$ (with O(|L|) space and $O(\log|L|)$ query time), where $r$ is the length of the run-length encoding of $T$ and $|L| = O(n)$ is the size of the index. In this paper we present new results for this problem. Our first result is an alternative construction of the index by Badkobeh et al. that obtains a trade-off between the space and the time complexity. It has $O(r^2\log k + n/k)$ complexity to build the index, $O(\log k)$ query time, and uses $O(n/k + |L|)$ space, where $k$ is a parameter. The second result is an $O(n^2 \log^2 w / w)$ algorithm (with O(n) space and O(1) query time), based on word-level parallelism where $w$ is the word size in bits.

preprint2013arXiv

On a compact encoding of the swap automaton

Given a string $P$ of length $m$ over an alphabet $Σ$ of size $σ$, a swapped version of $P$ is a string derived from $P$ by a series of local swaps, i.e., swaps of adjacent symbols, such that each symbol can participate in at most one swap. We present a theoretical analysis of the nondeterministic finite automaton for the language $\bigcup_{P'\inΠ_P}Σ^*P'$ (swap automaton for short), where $Π_P$ is the set of swapped versions of $P$. Our study is based on the bit-parallel simulation of the same automaton due to Fredriksson, and reveals an interesting combinatorial property that links the automaton to the one for the language $Σ^*P$. By exploiting this property and the method presented by Cantone et al. (2010), we obtain a bit-parallel encoding of the swap automaton which takes $O(σ^2\ceil{k/w})$ space and allows one to simulate the automaton on a string of length $n$ in time $O(n\ceil{k/w})$, where $\ceil{m/σ}\le k\le m$.

preprint2010arXiv

String Matching with Inversions and Translocations in Linear Average Time (Most of the Time)

We present an efficient algorithm for finding all approximate occurrences of a given pattern $p$ of length $m$ in a text $t$ of length $n$ allowing for translocations of equal length adjacent factors and inversions of factors. The algorithm is based on an efficient filtering method and has an $\bigO(nm\max(α, β))$-time complexity in the worst case and $\bigO(\max(α, β))$-space complexity, where $α$ and $β$ are respectively the maximum length of the factors involved in any translocation and inversion. Moreover we show that under the assumptions of equiprobability and independence of characters our algorithm has a $\bigO(n)$ average time complexity, whenever $σ= Ω(\log m / \log\log^{1-ε} m)$, where $ε> 0$ and $σ$ is the dimension of the alphabet. Experiments show that the new proposed algorithm achieves very good results in practical cases.

Emanuele Giaquinta

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Technology Beats Algorithms (in Exact String Matching)

Longest common substrings with k mismatches

Alternative Algorithms for Lyndon Factorization

Motif matching using gapped patterns

Run-Length Encoded Nondeterministic KMP and Suffix Automata

Approximate pattern matching with k-mismatches in packed text

Indexes for Jumbled Pattern Matching in Strings, Trees and Graphs

New algorithms for binary jumbled pattern matching

On a compact encoding of the swap automaton

String Matching with Inversions and Translocations in Linear Average Time (Most of the Time)