Researcher profile

Carl Barton

Carl Barton contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - Baseline
5works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2016arXiv

Average-Case Optimal Approximate Circular String Matching

Approximate string matching is the problem of finding all factors of a text t of length n that are at a distance at most k from a pattern x of length m. Approximate circular string matching is the problem of finding all factors of t that are at a distance at most k from x or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time O(n(k + log m)/m). Optimal average-case search time can also be achieved by the algorithms for multiple approximate string matching (Fredriksson and Navarro, 2004) using x and its rotations as the set of multiple patterns. Here we reduce the preprocessing time and space requirements compared to that approach.

preprint2016arXiv

Efficient Index for Weighted Sequences

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z \log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.

preprint2016arXiv

On the Average-case Complexity of Pattern Matching with Wildcards

Pattern matching with wildcards is the problem of finding all factors of a text $t$ of length $n$ that match a pattern $x$ of length $m$, where wildcards (characters that match everything) may be present. In this paper we present a number of fast average-case algorithms for pattern matching where wildcards are restricted to either the pattern or the text, however, the results are easily adapted to the case where wildcards are allowed in both. We analyse the \textit{average-case} complexity of these algorithms and show the first non-trivial time bounds. These are the first results on the average-case complexity of pattern matching with wildcards which, as a by product, provide with first provable separation in complexity between exact pattern matching and pattern matching with wildcards in the word RAM model.

preprint2015arXiv

Fast Average-Case Pattern Matching on Weighted Sequences

A weighted string over an alphabet of size $σ$ is a string in which a set of letters may occur at each position with respective occurrence probabilities. Weighted strings, also known as position weight matrices or uncertain sequences, naturally arise in many contexts. In this article, we study the problem of weighted string matching with a special focus on average-case analysis. Given a weighted pattern string $x$ of length $m$, a text string $y$ of length $n>m$, and a cumulative weight threshold $1/z$, defined as the minimal probability of occurrence of factors in a weighted string, we present an algorithm requiring average-case search time $o(n)$ for pattern matching for weight ratio $\frac{z}{m} < \min\{\frac{1}{\log z},\frac{\log σ}{\log z (\log m + \log \log σ)}\}$. For a pattern string $x$ of length $m$, a weighted text string $y$ of length $n>m$, and a cumulative weight threshold $1/z$, we present an algorithm requiring average-case search time $o(σn)$ for the same weight ratio. The importance of these results lies on the fact that these algorithms work in average-case sublinear search time in the size of the text, and in linear preprocessing time and space in the size of the pattern, for these ratios.

preprint2014arXiv

Linear-time Computation of Minimal Absent Words Using Suffix Array

An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n)-time and O(n)-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n^2)-time and O(n)-space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available. In this article, we bridge this unpleasant gap by presenting an O(n)-time and O(n)-space algorithm for computing all minimal absent words based on the construction of suffix arrays. Experimental results using real and synthetic data show that the respective implementation outperforms the one by Pinho et al.