Source author record

Giulio Ermanno Pibiri

Giulio Ermanno Pibiri appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Retrieval Data Structures and Algorithms Databases

Catalog footprint

What is connected

9works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

PTHash: Revisiting FCH Minimal Perfect Hashing

Given a set $S$ of $n$ distinct keys, a function $f$ that bijectively maps the keys of $S$ into the range $\{0,\ldots,n-1\}$ is called a minimal perfect hash function for $S$. Algorithms that find such functions when $n$ is large and retain constant evaluation time are of practical interest; for instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find $f$ (construction time), time to evaluate $f$ on a key of $S$ (lookup time), and space of representation for $f$. Several algorithms have been proposed to trade-off between these aspects. In 1992, Fox, Chen, and Heath (FCH) presented an algorithm at SIGIR providing very fast lookup evaluation. However, the approach received little attention because of its large construction time and higher space consumption compared to other subsequent techniques. Almost thirty years later we revisit their framework and present an improved algorithm that scales well to large sets and reduces space consumption altogether, without compromising the lookup time. We conduct an extensive experimental assessment and show that the algorithm finds functions that are competitive in space with state-of-the art techniques and provide $2-4\times$ better lookup time.

preprint2021arXiv

Rank/Select Queries over Mutable Bitmaps

The problem of answering rank/select queries over a bitmap is of utmost importance for many succinct data structures. When the bitmap does not change, many solutions exist in the theoretical and practical side. In this work we consider the case where one is allowed to modify the bitmap via a flip(i) operation that toggles its i-th bit. By adapting and properly extending some results concerning prefix-sum data structures, we present a practical solution to the problem, tailored for modern CPU instruction sets. Compared to the state-of-the-art, our solution improves runtime with no space degradation. Moreover, it does not incur in a significant runtime penalty when compared to the fastest immutable indexes, while providing even lower space overhead.

preprint2020arXiv

Compressed Indexes for Fast Search of Semantic Data

The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is devising a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. In this work, we propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis conducted over a wide range of publicly available real-world datasets, reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30-60% less space and speeding up query execution by a factor of 2-81x.

preprint2020arXiv

Efficient and Effective Query Auto-Completion

Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact space. However, searching by prefix has little discovery power in that only completions that are prefixed by the query are returned. This may impact negatively the effectiveness of the QAC system, with a consequent monetary loss for real applications like Web Search Engines and eCommerce. In this work we describe the implementation that empowers a new QAC system at eBay, and discuss its efficiency/effectiveness in relation to other approaches at the state-of-the-art. The solution is based on the combination of an inverted index with succinct data structures, a much less explored direction in the literature. This system is replacing the previous implementation based on Apache SOLR that was not always able to meet the required service-level-agreement.

preprint2020arXiv

Handling Massive N-Gram Datasets Efficiently

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.

preprint2020arXiv

On Optimally Partitioning Variable-Byte Codes

The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.

preprint2020arXiv

Practical Trade-Offs for the Prefix-Sum Problem

Given an integer array A, the prefix-sum problem is to answer sum(i) queries that return the sum of the elements in A[0..i], knowing that the integers in A can be changed. It is a classic problem in data structure design with a wide range of applications in computing from coding to databases. In this work, we propose and compare several and practical solutions to this problem, showing that new trade-offs between the performance of queries and updates can be achieved on modern hardware.

preprint2020arXiv

Succinct Dynamic Ordered Sets with Random Access

The representation of a dynamic ordered set of $n$ integer keys drawn from a universe of size $m$ is a fundamental data structuring problem. Many solutions to this problem achieve optimal time but take polynomial space, therefore preserving time optimality in the \emph{compressed} space regime is the problem we address in this work. For a polynomial universe $m = n^{Θ(1)}$, we give a solution that takes $\textsf{EF}(n,m) + o(n)$ bits, where $\textsf{EF}(n,m) \leq n\lceil \log_2(m/n)\rceil + 2n$ is the cost in bits of the \emph{Elias-Fano} representation of the set, and supports random access to the $i$-th smallest element in $O(\log n/ \log\log n)$ time, updates and predecessor search in $O(\log\log n)$ time. These time bounds are optimal.

preprint2020arXiv

Techniques for Inverted Index Compression

The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines. The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation.

Giulio Ermanno Pibiri

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

PTHash: Revisiting FCH Minimal Perfect Hashing

Rank/Select Queries over Mutable Bitmaps

Compressed Indexes for Fast Search of Semantic Data

Efficient and Effective Query Auto-Completion

Handling Massive N-Gram Datasets Efficiently

On Optimally Partitioning Variable-Byte Codes

Practical Trade-Offs for the Prefix-Sum Problem

Succinct Dynamic Ordered Sets with Random Access

Techniques for Inverted Index Compression