Researcher profile

Gonzalo Navarro

Gonzalo Navarro contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
30works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

30 published item(s)

preprint2022arXiv

Balancing Run-Length Straight-Line Programs*

It was recently proved that any SLP generating a given string $w$ can be transformed in linear time into an equivalent balanced SLP of the same asymptotic size. We show that this result also holds for RLSLPs, which are SLPs extended with run-length rules of the form $A \rightarrow B^t$ for $t>2$, deriving $\texttt{exp}(A) = \texttt{exp}(B)^t$. An immediate consequence is the simplification of the algorithm for extracting substrings of an RLSLP-compressed string. We also show that several problems like answering RMQs and computing Karp-Rabin fingerprints on substrings can be solved in $\mathcal{O}(g_{rl})$ space and $\mathcal{O}(\log n)$ time, $g_{rl}$ being the size of the smallest RLSLP generating the string, of length $n$. We extend the result to solving more general operations on string ranges, in $\mathcal{O}(g_{rl})$ space and $\mathcal{O}(\log n)$ applications of the operation. In general, the smallest RLSLP can be asymptotically smaller than the smallest SLP by up to an $\mathcal{O}(\log n)$ factor, so our results can make a difference in terms of the space needed for computing these operations efficiently for some string families.

preprint2022arXiv

Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices

As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments show that, as a compressor, our tool is clearly superior to gzip and it is usually within 20% of xz in terms of compression ratio. In addition, our compressed format supports matrix-vector multiplications in time and space proportional to the size of the compressed representation, unlike gzip and xz that require the full decompression of the compressed matrix. To our knowledge our lossless compressor is the first one achieving time and space complexities which match the theoretical limit expressed by the $k$-th order statistical entropy of the input. To achieve further time/space reductions, we propose column-reordering algorithms hinging on a novel column-similarity score. Our experiments on various data sets of ML matrices show that, with a modest preprocessing time, our column reordering can yield a further reduction of up to 16% in the peak memory usage during matrix-vector multiplication. Finally, we compare our proposal against the state-of-the-art Compressed Linear Algebra (CLA) approach showing that ours runs always at least twice faster (in a multi-thread setting) and achieves better compressed space occupancy for most of the tested data sets. This experimentally confirms the provably effective theoretical bounds we show for our compressed-matrix approach.

preprint2022arXiv

L-systems for Measuring Repetitiveness*

An L-system (for lossless compression) is a CPD0L-system extended with two parameters $d$ and $n$, which determines unambiguously a string $w = τ(φ^d(s))[1:n]$, where $φ$ is the morphism of the system, $s$ is its axiom, and $τ$ is its coding. The length of the shortest description of an L-system generating $w$ is known as $\ell$, and is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper we deepen the study of the measure $\ell$ and its relation with $δ$, a better established lower bound that builds on substring complexity. Our results show that $\ell$ and $δ$ are largely orthogonal, in the sense that one can be much larger than the other depending on the case. This suggests that both sources of repetitiveness are mostly unrelated. We also show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro-schemes, can be asymptotically strictly smaller than both mechanisms, which makes the size $ν$ of the smallest NU-system the unique smallest reachable repetitiveness measure to date.

preprint2021arXiv

PHONI: Streamed Matching Statistics with Multi-Genome References

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.

preprint2021arXiv

Towards a Definitive Compressibility Measure for Repetitive Sequences

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel--Ziv parse are frequently used to estimate it. The size $b \le z$ of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size $γ$ of the smallest string \emph{attractor}, was introduced. The measure $γ\le b$ lower bounds all the previous relevant ones, yet length-$n$ strings can be represented and efficiently indexed within space $O(γ\log\frac{n}γ)$, which also upper bounds most measures. While $γ$ is certainly a better measure of repetitiveness than $b$, it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in $o(γ\log n)$ space. In this paper, we study an even smaller measure, $δ\le γ$, which can be computed in linear time, is monotonic, and allows encoding every string in $O(δ\log\frac{n}δ)$ space because $z = O(δ\log\frac{n}δ)$. We show that $δ$ better captures the compressibility of repetitive strings. Concretely, we show that (1) $δ$ can be strictly smaller than $γ$, by up to a logarithmic factor; (2) there are string families needing $Ω(δ\log\frac{n}δ)$ space to be encoded, so this space is optimal for every $n$ and $δ$; (3) one can build run-length context-free grammars of size $O(δ\log\frac{n}δ)$, whereas the smallest (non-run-length) grammar can be up to $Θ(\log n/\log\log n)$ times larger; and (4) within $O(δ\log\frac{n}δ)$ space we can not only...

preprint2020arXiv

Approximating Optimal Bidirectional Macro Schemes

Lempel-Ziv is an easy-to-compute member of a wide family of so-called macro schemes; it restricts pointers to go in one direction only. Optimal bidirectional macro schemes are NP-complete to find, but they may provide much better compression on highly repetitive sequences. We consider the problem of approximating optimal bidirectional macro schemes. We describe a simulated annealing algorithm that usually converges quickly. Moreover, in some cases, we obtain bidirectional macro schemes that are provably a 2-approximation of the optimal. We test our algorithm on a number of artificial repetitive texts and verify that it is efficient in practice and outperforms Lempel-Ziv, sometimes by a wide margin.

preprint2020arXiv

Cell cycle and protein complex dynamics in discovering signaling pathways

Signaling pathways are responsible for the regulation of cell processes, such as monitoring the external environment, transmitting information across membranes, and making cell fate decisions. Given the increasing amount of biological data available and the recent discoveries showing that many diseases are related to the disruption of cellular signal transduction cascades, in silico discovery of signaling pathways in cell biology has become an active research topic in past years. However, reconstruction of signaling pathways remains a challenge mainly because of the need for systematic approaches for predicting causal relationships, like edge direction and activation/inhibition among interacting proteins in the signal flow. We propose an approach for predicting signaling pathways that integrates protein interactions, gene expression, phenotypes, and protein complex information. Our method first finds candidate pathways using a directed-edge-based algorithm and then defines a graph model to include causal activation relationships among proteins, in candidate pathways using cell cycle gene expression and phenotypes to infer consistent pathways in yeast. Then, we incorporate protein complex coverage information for deciding on the final predicted signaling pathways. We show that our approach improves the predictive results of the state of the art using different ranking metrics.

preprint2020arXiv

Grammar-Compressed Indexes with Logarithmic Search Time

Let a text $T[1..n]$ be the only string generated by a context-free grammar with $g$ (terminal and nonterminal) symbols, and of size $G$ (measured as the sum of the lengths of the right-hand sides of the rules). Such a grammar, called a grammar-compressed representation of $T$, can be encoded using essentially $G\lg g$ bits. We introduce the first grammar-compressed index that uses $O(G\lg n)$ bits and can find the $occ$ occurrences of patterns $P[1..m]$ in time $O((m^2+occ)\lg G)$. We implement the index and demonstrate its practicality in comparison with the state of the art, on highly repetitive text collections.

preprint2020arXiv

Lempel-Ziv-like Parsing in Small Space

Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the $k$th order empirical entropy compression $n H_k + o(n\logσ)$ with $k = o(\log_σn)$, where $n$ is the input length and $σ$ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be $Ω(\log n)$ times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below $2.0$ in all tested scenarios, and sometimes below $1.05$, to the size of LZ.

preprint2020arXiv

Optimal Joins using Compact Data Structures

Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count with several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality we either need to build completely new indexes, or we must populate the database with several instantiations of indexes such as B$+$-trees. Either way, this means spending an extra amount of storage space that may be non-negligible. We show that optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of extra storage. Our representation is a compact quad tree for the static indexes, and a dynamic quadtree sharing subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, and show that the running time of this algorithm is worst-case optimal in data complexity. Remarkably, we can extend our framework to evaluate more expressive queries from relational algebra by introducing a lazy version of qdags (lqdags). Once again, we can show that the running time of our algorithms is worst-case optimal.

preprint2020arXiv

PFP Data Structures

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $|\mathrm{PFP} (S)|$. In practice $D$ and $P$ are significantly smaller than $S$ and computing $\mathrm{BWT} (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ consists of genomes from individuals of the same species. In this paper, we consider $\mathrm{PFP} (S)$ as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in $O (|\mathrm{PFP} (S)|)$ space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for $1000$ variants of human chromosome 19, initially occupying roughly 56 GB.

preprint2020arXiv

Practical Random Access to SLP-Compressed Texts

Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries.

preprint2020arXiv

Semantrix: A Compressed Semantic Matrix

We present a compact data structure to represent both the duration and length of homogeneous segments of trajectories from moving objects in a way that, as a data warehouse, it allows us to efficiently answer cumulative queries. The division of trajectories into relevant segments has been studied in the literature under the topic of Trajectory Segmentation. In this paper, we design a data structure to compactly represent them and the algorithms to answer the more relevant queries. We experimentally evaluate our proposal in the real context of an enterprise with mobile workers (truck drivers) where we aim at analyzing the time they spend in different activities. To test our proposal under higher stress conditions we generated a huge amount of synthetic realistic trajectories and evaluated our system with those data to have a good idea about its space needs and its efficiency when answering different types of queries.

preprint2013arXiv

Compressed Vertical Partitioning for Full-In-Memory RDF Management

The Web of Data has been gaining momentum and this leads to increasingly publish more semi-structured datasets following the RDF model, based on atomic triple units of subject, predicate, and object. Although it is a simple model, compression methods become necessary because datasets are increasingly larger and various scalability issues arise around their organization and storage. This requirement is more restrictive in RDF stores because efficient SPARQL resolution on the compressed RDF datasets is also required. This article introduces a novel RDF indexing technique (called k2-triples) supporting efficient SPARQL resolution in compressed space. k2-triples, uses the predicate to vertically partition the dataset into disjoint subsets of pairs (subject, object), one per predicate. These subsets are represented as binary matrices in which 1-bits mean that the corresponding triple exists in the dataset. This model results in very sparse matrices, which are efficiently compressed using k2-trees. We enhance this model with two compact indexes listing the predicates related to each different subject and object, in order to address the specific weaknesses of vertically partitioned representations. The resulting technique not only achieves by far the most compressed representations, but also the best overall performance for RDF retrieval in our experiments. Our approach uses up to 10 times less space than a state of the art baseline, and outperforms its performance by several order of magnitude on the most basic query patterns. In addition, we optimize traditional join algorithms on k2-triples and define a novel one leveraging its specific features. Our experimental results show that our technique overcomes traditional vertical partitioning for join resolution, reporting the best numbers for joins in which the non-joined nodes are provided, and being competitive in the majority of the cases.

preprint2013arXiv

Encoding Range Minimum Queries

We consider the problem of encoding range minimum queries (RMQs): given an array A[1..n] of distinct totally ordered values, to pre-process A and create a data structure that can answer the query RMQ(i,j), which returns the index containing the smallest element in A[i..j], without access to the array A at query time. We give a data structure whose space usage is 2n + o(n) bits, which is asymptotically optimal for worst-case data, and answers RMQs in O(1) worst-case time. This matches the previous result of Fischer and Heun [SICOMP, 2011], but is obtained in a more natural way. Furthermore, our result can encode the RMQs of a random array A in 1.919n + o(n) bits in expectation, which is not known to hold for Fischer and Heun's result. We then generalize our result to the encoding range top-2 query (RT2Q) problem, which is like the encoding RMQ problem except that the query RT2Q(i,j) returns the indices of both the smallest and second-smallest elements of A[i..j]. We introduce a data structure using 3.272n+o(n) bits that answers RT2Qs in constant time, and also give lower bounds on the effective entropy} of RT2Q.

preprint2013arXiv

Optimal Dynamic Sequence Representations

We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on a string $S[1,n]$ over alphabet $[1..σ]$ in time $O(\lg n/\lg\lg n)$, which is optimal even on binary sequences and in the amortized sense. Our time is worst-case for the queries and amortized for the updates. This complexity is better than the best previous ones by a $Θ(1+\lgσ/\lg\lg n)$ factor. We also design a variant where times are worst-case, yet rank and updates take $O(\lg n)$ time. Our structure uses $nH_0(S)+o(n\lgσ) + O(σ\lg n)$ bits, where $H_0(S)$ is the zero-order entropy of $S$. Finally, we pursue various extensions and applications of the result.

preprint2013arXiv

Spaces, Trees and Colors: The Algorithmic Landscape of Document Retrieval on Sequences

Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to "natural language" text collections, where inverted indices are the preferred solution. As successful as this paradigm has been, it fails to properly handle some East Asian languages and other scenarios where the "natural language" assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many others. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and others.

preprint2012arXiv

Compact Binary Relation Representations with Rich Functionality

Binary relations are an important abstraction arising in many data representation problems. The data structures proposed so far to represent them support just a few basic operations required to fit one particular application. We identify many of those operations arising in applications and generalize them into a wide set of desirable queries for a binary relation representation. We also identify reductions among those operations. We then introduce several novel binary relation representations, some simple and some quite sophisticated, that not only are space-efficient but also efficiently support a large subset of the desired queries.

preprint2012arXiv

Efficient Fully-Compressed Sequence Representations

We present a data structure that stores a sequence $s[1..n]$ over alphabet $[1..σ]$ in $n\Ho(s) + o(n)(\Ho(s){+}1)$ bits, where $\Ho(s)$ is the zero-order entropy of $s$. This structure supports the queries \access, \rank\ and \select, which are fundamental building blocks for many other compressed data structures, in worst-case time $\Oh{\lg\lgσ}$ and average time $\Oh{\lg \Ho(s)}$. The worst-case complexity matches the best previous results, yet these had been achieved with data structures using $n\Ho(s)+o(n\lgσ)$ bits. On highly compressible sequences the $o(n\lgσ)$ bits of the redundancy may be significant compared to the the $n\Ho(s)$ bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve $(i)$ compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; $(ii)$ compressed permutations $π$ with times for $π()$ and $\pii()$ improved to loglogarithmic; and $(iii)$ the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. ...

preprint2012arXiv

Faster Compact Top-k Document Retrieval

An optimal index solving top-k document retrieval [Navarro and Nekrich, SODA12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n to 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5% more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.

preprint2012arXiv

Sorted Range Reporting

In this paper we consider a variant of the orthogonal range reporting problem when all points should be reported in the sorted order of their $x$-coordinates. We show that reporting two-dimensional points with this additional condition can be organized (almost) as efficiently as the standard range reporting. Moreover, our results generalize and improve the previously known results for the orthogonal range successor problem and can be used to obtain better solutions for some stringology problems.

preprint2012arXiv

Space-Efficient Data-Analysis Queries on Grids

We consider various data-analysis queries on two-dimensional points. We give new space/time tradeoffs over previous work on geometric queries such as dominance and rectangle visibility, and on semigroup and group queries such as sum, average, variance, minimum and maximum. We also introduce new solutions to queries less frequently considered in the literature such as two-dimensional quantiles, majorities, successor/predecessor, mode, and various top-$k$ queries, considering static and dynamic scenarios.

preprint2011arXiv

Compressed String Dictionaries

The problem of storing a set of strings --- a string dictionary --- in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), more recent applications in Web engines, Web mining, RDF graphs, Internet routing, Bioinformatics, and many others, make use of very large string dictionaries, whose size is a significant fraction of the whole data. Thus novel approaches to compress them efficiently are necessary. In this paper we experimentally compare time and space performance of some existing alternatives, as well as new ones we propose. We show that space reductions of up to 20% of the original size of the strings is possible while supporting fast dictionary searches.

preprint2011arXiv

Improved Grammar-Based Compressed Indexes

We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text $T[1..u]$ that is represented by a (context-free) grammar of $n$ (terminal and nonterminal) symbols and size $N$ (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of $T$ takes $N\lg n$ bits of space. Our representation requires $2N\lg n + N\lg u + ε\, n\lg n + o(N\lg n)$ bits of space, for any $0<ε\le 1$. It can find the positions of the $occ$ occurrences of a pattern of length $m$ in $T$ in $O((m^2/ε)\lg (\frac{\lg u}{\lg n}) +occ\lg n)$ time, and extract any substring of length $\ell$ of $T$ in time $O(\ell+h\lg(N/h))$, where $h$ is the height of the grammar tree.

preprint2011arXiv

On Compressing Permutations and Adaptive Sorting

Previous compact representations of permutations have focused on adding a small index on top of the plain data $<π(1), π(2),...π(n)>$, in order to efficiently support the application of the inverse or the iterated permutation. In this paper we initiate the study of techniques that exploit the compressibility of the data itself, while retaining efficient computation of $π(i)$ and its inverse. In particular, we focus on exploiting {\em runs}, which are subsets (contiguous or not) of the domain where the permutation is monotonic. Several variants of those types of runs arise in real applications such as inverted indexes and suffix arrays. Furthermore, our improved results on compressed data structures for permutations also yield better adaptive sorting algorithms.

preprint2011arXiv

Practical Top-K Document Retrieval in Reduced Space

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel algorithms and data structures dominate almost all the space/time tradeoff.

preprint2011arXiv

Self-Index Based on LZ77

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.6 times), extracts 1--2 million characters of the text per second, and finds patterns at a rate of 10--50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.

preprint2011arXiv

Self-Index based on LZ77 (thesis)

Domains like bioinformatics, version control systems, collaborative editing systems (wiki), and others, are producing huge data collections that are very repetitive. That is, there are few differences between the elements of the collection. This fact makes the compressibility of the collection extremely high. For example, a collection with all different versions of a Wikipedia article can be compressed up to the 0.1% of its original space, using the Lempel-Ziv 1977 (LZ77) compression scheme. Many of these repetitive collections handle huge amounts of text data. For that reason, we require a method to store them efficiently, while providing the ability to operate on them. The most common operations are the extraction of random portions of the collection and the search for all the occurrences of a given pattern inside the whole collection. A self-index is a data structure that stores a text in compressed form and allows to find the occurrences of a pattern efficiently. On the other hand, self-indexes can extract any substring of the collection, hence they are able to replace the original text. One of the main goals when using these indexes is to store them within main memory. In this thesis we present a scheme for random text extraction from text compressed with a Lempel-Ziv parsing. Additionally, we present a variant of LZ77, called LZ-End, that efficiently extracts text using space close to that of LZ77. The main contribution of this thesis is the first self-index based on LZ77/LZ-End and oriented to repetitive texts, which outperforms the state of the art (the RLCSA self-index) in many aspects. Finally, we present a corpus of repetitive texts, coming from several application domains. We aim at providing a standard set of texts for research and experimentation, hence this corpus is publicly available.

preprint2010arXiv

Fully-Functional Static and Dynamic Succinct Trees

We propose new succinct representations of ordinal trees, which have been studied extensively. It is known that any $n$-node static tree can be represented in $2n + o(n)$ bits and a number of operations on the tree can be supported in constant time under the word-RAM model. However the data structures are complicated and difficult to dynamize. We propose a simple and flexible data structure, called the range min-max tree, that reduces the large number of relevant tree operations considered in the literature to a few primitives that are carried out in constant time on sufficiently small trees. The result is extended to trees of arbitrary size, achieving $2n + O(n /\polylog(n))$ bits of space. The redundancy is significantly lower than any previous proposal. Our data structure builds on the range min-max tree to achieve $2n+O(n/\log n)$ bits of space and $O(\log n)$ time for all the operations. We also propose an improved data structure using $2n+O(n\log\log n/\log n)$ bits and improving the time to the optimal $O(\log n/\log \log n)$ for most operations. Furthermore, we support sophisticated operations that allow attaching and detaching whole subtrees, in time $\Order(\log^{1+ε} n / \log\log n)$. Our techniques are of independent interest. One allows representing dynamic bitmaps and sequences supporting rank/select and indels, within zero-order entropy bounds and optimal time $O(\log n / \log\log n)$ for all operations on bitmaps and polylog-sized alphabets, and $O(\log n \log σ/ (\log\log n)^2)$ on larger alphabet sizes $σ$. This improves upon the best existing bounds for entropy-bounded storage of dynamic sequences, compressed full-text self-indexes, and compressed-space construction of the Burrows-Wheeler transform.

preprint2010arXiv

New Algorithms on Wavelet Trees and Applications to Information Retrieval

Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as {\em range quantile} queries, {\em range next value} queries, and {\em range intersection} queries. We explore several applications of these queries in Information Retrieval, in particular {\em document retrieval} in hierarchical and temporal documents, and in the representation of {\em inverted lists}.