Source author record

Djamal Belazzougui

Djamal Belazzougui appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Complexity Discrete Mathematics Information Retrieval Information Theory math.IT

Catalog footprint

What is connected

27works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Efficient tree-structured categorical retrieval

We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$ \emph{categorical units} containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses $n(\logσ(1+o(1))+\log D+O(h)) + O(Δ)$ bits of space and $O(|p|+t)$ query time, where $n$ is the total length of the documents, $σ$ the size of the alphabet used in the documents and $Δ$ is the total number of nodes in the category tree. Another solution uses $n(\logσ(1+o(1))+O(\log D))+O(Δ)+O(D\log n)$ bits of space and $O(|p|+t\log D)$ query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

preprint2016arXiv

Edit Distance: Sketching, Streaming and Document Exchange

We show that in the document exchange problem, where Alice holds $x \in \{0,1\}^n$ and Bob holds $y \in \{0,1\}^n$, Alice can send Bob a message of size $O(K(\log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x$ and $y$ is no more than $K$, and output "error" otherwise. Both the encoding and decoding can be done in time $\tilde{O}(n+\mathsf{poly}(K))$. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold $x$ and $y$ respectively, they can compute sketches of $x$ and $y$ of sizes $\mathsf{poly}(K \log n)$ bits (the encoding), and send to the referee, who can then compute the edit distance between $x$ and $y$ together with all the edit operations if the edit distance is no more than $K$, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using $\mathsf{poly}(K \log n)$ bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using $\mathsf{poly}(K \log n)$ bits of space.

preprint2016arXiv

Fully Dynamic de Bruijn Graphs

We present a space- and time-efficient fully dynamic implementation de Bruijn graphs, which can also support fixed-length jumbled pattern matching.

preprint2016arXiv

Indexing and querying color sets of images

We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $n\geq m$ columns where each element in the matrix is an integer from $[1,σ]$ named a {\em color}. The set of distinct colors in a region is called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints is denoted by ${\cal F}$. A rectangle is {\em maximal} if it is not contained in a greater rectangle with the same fingerprint. The set of all locations of maximal rectangles is denoted by $\mathcal{L}.$ We first explain how to determine all the $|\mathcal{L}|$ maximal locations with their fingerprints in expected time $O(nm^2σ)$ using a Monte Carlo algorithm (with polynomially small probability of error) or within deterministic $O(nm^2σ\log(\frac{|\mathcal{L}|}{nm^2}+2))$ time. We then show how to build a data structure which occupies $O(nm\log n+\mathcal{|L|})$ space such that a query which asks for all the maximal locations with a given fingerprint $f$ can be answered in time $O(|f|+\log\log n+k)$, where $k$ is the number of maximal locations with fingerprint $f$. If the query asks only for the presence of the fingerprint, then the space usage becomes $O(nm\log n+|{\cal F}|)$ while the query time becomes $O(|f|+\log\log n)$. We eventually consider the special case of squared regions (squares).

preprint2016arXiv

Linear time construction of compressed text indices in compact space

We show that the compressed suffix array and the compressed suffix tree for a string of length $n$ over an integer alphabet of size $σ\leq n$ can both be built in $O(n)$ (randomized) time using only $O(n\logσ)$ bits of working space. The previously fastest construction algorithms that used $O(n\logσ)$ bits of space took times $O(n\log\logσ)$ and $O(n\log^εn)$ respectively (where $ε$ is any positive constant smaller than $1$). In the passing, we show that the Burrows-Wheeler transform of a string of length $n$ over an alphabet of size $σ$ can be built in deterministic $O(n)$ time and space $O(n\logσ)$. We also show that within the same time and space, we can carry many sequence analysis tasks and construct some variants of the compressed suffix array and compressed suffix tree.

preprint2016arXiv

Linear-time string indexing and analysis in small space

The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string $T\in \{1,\ldots,σ\}^n$ can be built in deterministic $O(n)$ time using just $O(n\logσ)$ bits of space, where $σ\leq n$. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of $T$. Many fundamental string analysis problems can be mapped to such enumeration, and can thus be solved in deterministic $O(n)$ time and in $O(n\logσ)$ bits of space from the input string. We also show how to build many of the existing indexes based on the BWT, such as the CSA, the compressed suffix tree (CST), and the bidirectional BWT index, in randomized $O(n)$ time and in $O(n\logσ)$ bits of space. The previously fastest construction algorithms for BWT, CSA and CST, which used $O(n\logσ)$ bits of space, took $O(n\log{\logσ})$ time for the first two structures, and $O(n\log^εn)$ time for the third, where $ε$ is any positive constant. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output.

preprint2016arXiv

Practical combinations of repetition-aware data structures

Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct measures of repetition all grow sublinearly in the length of a highly-repetitive string. In this paper we explore the practical advantages of combining data structures whose size depends on distinct measures of repetition. The main ingredient of our structures is the run-length encoded BWT (RLBWT), which takes space proportional to the number of runs in the Burrows-Wheeler transform of a string. We describe a range of practical variants that combine RLBWT with the set of boundaries of the Lempel-Ziv 77 factors of a string, which take space proportional to the number of factors. Such variants use, respectively, the RLBWT of a string and the RLBWT of its reverse, or just one RLBWT inside a bidirectional index, or just one RLBWT with support for unidirectional extraction. We also study the practical advantages of combining RLBWT with the compact directed acyclic word graph of a string, a data structure that takes space proportional to the number of one-character extensions of maximal repeats. Our approaches are easy to implement, and provide competitive tradeoffs on significant datasets.

preprint2016arXiv

Range Majorities and Minorities in Arrays

Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks us to preprocess a string of length $n$ such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold $τ$. Subsequent authors have reduced their time and space bounds such that, when $τ$ is fixed at preprocessing time, we need either $O(n \log (1 / τ))$ space and optimal $O(1 / τ)$ query time or linear space and $O((1 / τ) \log \log σ)$ query time, where $σ$ is the alphabet size. In this paper we give the first linear-space solution with optimal $O(1 / τ)$ query time, even with variable $τ$ (i.e., specified with the query). For the case when $σ$ is polynomial on the computer word size, our space is optimally compressed according to the symbol frequencies in the string. Otherwise, either the compressed space is increased by an arbitrarily small constant factor or the time rises to any function in $(1/τ)\cdotω(1)$. We obtain the same results on the complementary problem of parameterized range minority introduced by Chan et al. (2015), who had achieved linear space and $O(1 / τ)$ query time with variable $τ$.

preprint2015arXiv

A framework for space-efficient string kernels

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the $k$-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in $O(nd)$ time and in $o(n)$ bits of space in addition to the input, using just a $\mathtt{rangeDistinct}$ data structure on the Burrows-Wheeler transform of the input strings, which takes $O(d)$ time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of $k$, like the $k$-mer profile and the $k$-th order empirical entropy, and for calibrating the value of $k$ using the data.

preprint2015arXiv

Composite repetition-aware data structures

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.

preprint2015arXiv

Efficient Deterministic Single Round Document Exchange for Edit Distance

Suppose that we have two parties that possess each a binary string. Suppose that the length of the first string (document) is $n$ and that the two strings (documents) have edit distance (minimal number of deletes, inserts and substitutions needed to transform one string into the other) at most $k$. The problem we want to solve is to devise an efficient protocol in which the first party sends a single message that allows the second party to guess the first party's string. In this paper we show an efficient deterministic protocol for this problem. The protocol runs in time $O(n\cdot \mathtt{polylog}(n))$ and has message size $O(k^2+k\log^2n)$ bits. To the best of our knowledge, ours is the first efficient deterministic protocol for this problem, if efficiency is measured in both the message size and the running time. As an immediate application of our new protocol, we show a new error correcting code that is efficient even for large numbers of (adversarial) edit errors.

preprint2015arXiv

Optimal Las Vegas reduction from one-way set reconciliation to error correction

Suppose we have two players $A$ and $C$, where player $A$ has a string $s[0..u-1]$ and player $C$ has a string $t[0..u-1]$ and none of the two players knows the other's string. Assume that $s$ and $t$ are both over an integer alphabet $[σ]$, where the first string contains $n$ non-zero entries. We would wish to answer to the following basic question. Assuming that $s$ and $t$ differ in at most $k$ positions, how many bits does player $A$ need to send to player $C$ so that he can recover $s$ with certainty? Further, how much time does player $A$ need to spend to compute the sent bits and how much time does player $C$ need to recover the string $s$? This problem has a certain number of applications, for example in databases, where each of the two parties possesses a set of $n$ key-value pairs, where keys are from the universe $[u]$ and values are from $[σ]$ and usually $n\ll u$. In this paper, we show a time and message-size optimal Las Vegas reduction from this problem to the problem of systematic error correction of $k$ errors for strings of length $Θ(n)$ over an alphabet of size $2^{Θ(\logσ+\log (u/n))}$. The additional running time incurred by the reduction is linear randomized for player $A$ and linear deterministic for player $B$, but the correction works with certainty. When using the popular Reed-Solomon codes, the reduction gives a protocol that transmits $O(k(\log u+\logσ))$ bits and runs in time $O(n\cdot\mathrm{polylog}(n)(\log u+\logσ))$ for all values of $k$. The time is randomized for player $A$ (encoding time) and deterministic for player $C$ (decoding time). The space is optimal whenever $k\leq (uσ)^{1-Ω(1)}$.

preprint2015arXiv

Range Predecessor and Lempel-Ziv Parsing

The Lempel-Ziv parsing of a string (LZ77 for short) is one of the most important and widely-used algorithmic tools in data compression and string processing. We show that the Lempel-Ziv parsing of a string of length $n$ on an alphabet of size $σ$ can be computed in $O(n\log\logσ)$ time ($O(n)$ time if we allow randomization) using $O(n\logσ)$ bits of working space; that is, using space proportional to that of the input string in bits. The previous fastest algorithm using $O(n\logσ)$ space takes $O(n(\logσ+\log\log n))$ time. We also consider the important rightmost variant of the problem, where the goal is to associate with each phrase of the parsing its most recent occurrence in the input string. We solve this problem in $O(n(1 + (\logσ/\sqrt{\log n}))$ time, using the same working space as above. The previous best solution for rightmost parsing uses $O(n(1+\logσ/\log\log n))$ time and $O(n\log n)$ space. As a bonus, in our solution for rightmost parsing we provide a faster construction method for efficient 2D orthogonal range reporting, which is of independent interest.

preprint2015arXiv

Space-efficient detection of unusual words

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of $O(σ^2\log^2 n)$ bits, where $n$ is the length of the string and $σ$ is the size of the alphabet. The size of the stack is $o(n)$ except for very large values of $σ$. We further improve the algorithm by removing its time dependency on $σ$, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that $\textit{do not occur}$ in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.

preprint2014arXiv

Better Space Bounds for Parameterized Range Majority and Minority

Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks to preprocess a string of length $n$ such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold $τ$. Subsequent authors have reduced their time and space bounds such that, when $τ$ is given at preprocessing time, we need either $\Oh{n \log (1 / τ)}$ space and optimal $\Oh{1 / τ}$ query time or linear space and $\Oh{(1 / τ) \log \log σ}$ query time, where $σ$ is the alphabet size. In this paper we give the first linear-space solution with optimal $\Oh{1 / τ}$ query time. For the case when $τ$ is given at query time, we significantly improve previous bounds, achieving either $\Oh{n \log \log σ}$ space and optimal $\Oh{1 / τ}$ query time or compressed space and $\Oh{(1 / τ) \log \frac{\log (1 / τ)}{\log w}}$ query time. Along the way, we consider the complementary problem of parameterized range minority that was recently introduced by Chan et al.\ (2012), who achieved linear space and $\Oh{1 / τ}$ query time even for variable $τ$. We improve their solution to use either nearly optimally compressed space with no slowdown, or optimally compressed space with nearly no slowdown. Some of our intermediate results, such as density-sensitive query time for one-dimensional range counting, may be of independent interest.

preprint2014arXiv

Faster construction of asymptotically good unit-cost error correcting codes in the RAM model

Assuming we are in a Word-RAM model with word size $w$, we show that we can construct in $o(w)$ time an error correcting code with a constant relative positive distance that maps numbers of $w$ bits into $Θ(w)$-bit numbers, and such that the application of the error-correcting code on any given number $x\in[0,2^w-1]$ takes constant time. Our result improves on a previously proposed error-correcting code with the same properties whose construction time was exponential in $w$.

preprint2014arXiv

Improved space-time tradeoffs for approximate full-text indexing with one edit error

In this paper we are interested in indexing texts for substring matching queries with one edit error. That is, given a text $T$ of $n$ characters over an alphabet of size $σ$, we are asked to build a data structure that answers the following query: find all the $occ$ substrings of the text that are at edit distance at most $1$ from a given string $q$ of length $m$. In this paper we show two new results for this problem. The first result, suitable for an unbounded alphabet, uses $O(n\log^εn)$ (where $ε$ is any constant such that $0<ε<1$) words of space and answers to queries in time $O(m+occ)$. This improves simultaneously in space and time over the result of Cole et al. The second result, suitable only for a constant alphabet, relies on compressed text indices and comes in two variants: the first variant uses $O(n\log^ε n)$ bits of space and answers to queries in time $O(m+occ)$, while the second variant uses $O(n\log\log n)$ bits of space and answers to queries in time $O((m+occ)\log\log n)$. This second result improves on the previously best results for constant alphabets achieved in Lam et al. (Algorithmica 2008) and Chan et al. (Algorithmica 2010).

preprint2014arXiv

Queries on LZ-Bounded Encodings

We describe a data structure that stores a string $S$ in space similar to that of its Lempel-Ziv encoding and efficiently supports access, rank and select queries. These queries are fundamental for implementing succinct and compressed data structures, such as compressed trees and graphs. We show that our data structure can be built in a scalable manner and is both small and fast in practice compared to other data structures supporting such queries.

preprint2014arXiv

Rank, select and access in grammar-compressed strings

Given a string $S$ of length $N$ on a fixed alphabet of $σ$ symbols, a grammar compressor produces a context-free grammar $G$ of size $n$ that generates $S$ and only $S$. In this paper we describe data structures to support the following operations on a grammar-compressed string: $\mbox{rank}_c(S,i)$ (return the number of occurrences of symbol $c$ before position $i$ in $S$); $\mbox{select}_c(S,i)$ (return the position of the $i$th occurrence of $c$ in $S$); and $\mbox{access}(S,i,j)$ (return substring $S[i,j]$). For rank and select we describe data structures of size $O(nσ\log N)$ bits that support the two operations in $O(\log N)$ time. We propose another structure that uses $O(nσ\log (N/n)(\log N)^{1+ε})$ bits and that supports the two queries in $O(\log N/\log\log N)$, where $ε>0$ is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires $O(n\log N)$ bits of space and $O(\log N+m/\log_σN)$ time to extract $m=j-i+1$ consecutive symbols from $S$. Alternatively, we can achieve $O(\log N/\log\log N+m/\log_σN)$ query time using $O(n\log (N/n)(\log N)^{1+ε})$ bits of space. This matches a lower bound stated by Verbin and Yu for strings where $N$ is polynomially related to $n$.

preprint2014arXiv

Reusing an FM-index

Intuitively, if two strings $S_1$ and $S_2$ are sufficiently similar and we already have an FM-index for $S_1$ then, by storing a little extra information, we should be able to reuse parts of that index in an FM-index for $S_2$. We formalize this intuition and show that it can lead to significant space savings in practice, as well as to some interesting theoretical problems.

preprint2013arXiv

Cache-Oblivious Peeling of Random Hypergraphs

The computation of a peeling order in a randomly generated hypergraph is the most time-consuming step in a number of constructions, such as perfect hashing schemes, random $r$-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory. We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires $O(\mathrm{sort}(n))$ I/Os and $O(n \log n)$ time to peel a random hypergraph with $n$ edges. We experimentally evaluate the performance of our implementation of this algorithm in a real-world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of $7.6$ billion keys in less than $21$ hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets.

preprint2013arXiv

Optimal Lower and Upper Bounds for Representing Sequences

Sequence representations supporting queries $access$, $select$ and $rank$ are at the core of many data structures. There is a considerable gap between the various upper bounds and the few lower bounds known for such representations, and how they relate to the space used. In this article we prove a strong lower bound for $rank$, which holds for rather permissive assumptions on the space used, and give matching upper bounds that require only a compressed representation of the sequence. Within this compressed space, operations $access$ and $select$ can be solved in constant or almost-constant time, which is optimal for large alphabets. Our new upper bounds dominate all of the previous work in the time/space map.

preprint2013arXiv

Single and multiple consecutive permutation motif search

Let $t$ be a permutation (that shall play the role of the {\em text}) on $[n]$ and a pattern $p$ be a sequence of $m$ distinct integer(s) of $[n]$, $m\leq n$. The pattern $p$ occurs in $t$ in position $i$ if and only if $p_1... p_m$ is order-isomorphic to $t_i... t_{i+m-1}$, that is, for all $1 \leq k< \ell \leq m$, $p_k>p_\ell$ if and only if $t_{i+k-1}>t_{i+\ell-1}$. Searching for a pattern $p$ in a text $t$ consists in identifying all occurrences of $p$ in $t$. We first present a forward automaton which allows us to search for $p$ in $t$ in $O(m^2\log \log m +n)$ time. We then introduce a Morris-Pratt automaton representation of the forward automaton which allows us to reduce this complexity to $O(m\log \log m +n)$ at the price of an additional amortized constant term by integer of the text. Both automata occupy $O(m)$ space. We then extend the problem to search for a set of patterns and exhibit a specific Aho-Corasick like algorithm. Next we present a sub-linear average case search algorithm running in $O(\frac{m\log m}{\log\log m}+\frac{n\log m}{m\log\log m})$ time, that we eventually prove to be optimal on average.

preprint2013arXiv

Various improvements to text fingerprinting

Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of s_i .. s_j is f and s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is {\cal L} (it is easy to see that |{\cal L}| \leq n σ). Two maximal locations <i,j> and <k,l> such that s_i .. s_j = s_k .. s_l are named {\em copies}, and the quotient set of {\cal L} according to the copy relation is denoted by {\cal L}_C. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute {\cal F}; (2) given f as a set of distinct characters in Σ, to answer if f represents a fingerprint in {\cal F}; (3) given f, to find all maximal locations of f in s.

preprint2012arXiv

Predecessor search with distance-sensitive query time

A predecessor (successor) search finds the largest element $x^-$ smaller than the input string $x$ (the smallest element $x^+$ larger than or equal to $x$, respectively) out of a given set $S$; in this paper, we consider the static case (i.e., $S$ is fixed and does not change over time) and assume that the $n$ elements of $S$ are available for inspection. We present a number of algorithms that, with a small additional index (usually of O(n log w) bits, where $w$ is the string length), can answer predecessor/successor queries quickly and with time bounds that depend on different kinds of distance, improving significantly several results that appeared in the recent literature. Intuitively, our first result has a running time that depends on the distance between $x$ and $x^\pm$: it is especially efficient when the input $x$ is either very close to or very far from $x^-$ or $x^+$; our second result depends on some global notion of distance in the set $S$, and is fast when the elements of $S$ are more or less equally spaced in the universe; finally, for our third result we rely on a finger (i.e., an element of $S$) to improve upon the first one; its running time depends on the distance between the input and the finger.

preprint2011arXiv

Worst case efficient single and multiple string matching in the Word-RAM model

In this paper, we explore worst-case solutions for the problems of single and multiple matching on strings in the word RAM model with word length w. In the first problem, we have to build a data structure based on a pattern p of length m over an alphabet of size sigma such that we can answer to the following query: given a text T of length n, where each character is encoded using log(sigma) bits return the positions of all the occurrences of p in T (in the following we refer by occ to the number of reported occurrences). For the multi-pattern matching problem we have a set S of d patterns of total length m and a query on a text T consists in finding all positions of all occurrences in T of the patterns in S. As each character of the text is encoded using log sigma bits and we can read w bits in constant time in the RAM model, we assume that we can read up to (w/log sigma) consecutive characters of the text in one time step. This implies that the fastest possible query time for both problems is O((n(log sigma/w)+occ). In this paper we present several different results for both problems which come close to that best possible query time. We first present two different linear space data structures for the first and second problem: the first one answers to single pattern matching queries in time O(n(1/m+log sigma/w)+occ) while the second one answers to multiple pattern matching queries to O(n((log d+log y+log log d)/y+log sigma/w)+occ) where y is the length of the shortest pattern in the case of multiple pattern-matching. We then show how a simple application of the four russian technique permits to get data structures with query times independent of the length of the shortest pattern (the length of the only pattern in case of single string matching) at the expense of using more space.

preprint2010arXiv

Succinct Dictionary Matching With No Slowdown

The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size sigma, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences in a text T in time O(|T| + occ) using a data structure that occupies O(m log m) bits of space where m <= n + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while still maintaining the ability to answer to queries in O(|T| + occ) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses space O(n log sigma) while answering queries in O(|T|log log n + occ) time. In this paper we also show how the space occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the empirical entropy of the characters appearing in the trie representation of the set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The query time remains unchanged.

Djamal Belazzougui

What is connected

Connect this record

See the researcher in context

Building this map preview

27 published item(s)

Efficient tree-structured categorical retrieval

Edit Distance: Sketching, Streaming and Document Exchange

Fully Dynamic de Bruijn Graphs

Indexing and querying color sets of images

Linear time construction of compressed text indices in compact space

Linear-time string indexing and analysis in small space

Practical combinations of repetition-aware data structures

Range Majorities and Minorities in Arrays

A framework for space-efficient string kernels

Composite repetition-aware data structures

Efficient Deterministic Single Round Document Exchange for Edit Distance

Optimal Las Vegas reduction from one-way set reconciliation to error correction

Range Predecessor and Lempel-Ziv Parsing

Space-efficient detection of unusual words

Better Space Bounds for Parameterized Range Majority and Minority

Faster construction of asymptotically good unit-cost error correcting codes in the RAM model

Improved space-time tradeoffs for approximate full-text indexing with one edit error

Queries on LZ-Bounded Encodings

Rank, select and access in grammar-compressed strings

Reusing an FM-index

Cache-Oblivious Peeling of Random Hypergraphs

Optimal Lower and Upper Bounds for Representing Sequences

Single and multiple consecutive permutation motif search

Various improvements to text fingerprinting

Predecessor search with distance-sensitive query time

Worst case efficient single and multiple string matching in the Word-RAM model

Succinct Dictionary Matching With No Slowdown