Source author record

Roberto Grossi

Roberto Grossi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Information Theory math.IT Computational Complexity Databases

Catalog footprint

What is connected

13works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Exact pattern matching in labeled graphs is the problem of searching paths of a graph $G=(V,E)$ that spell the same string as the pattern $P[1..m]$. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks, where the nodes of some paths must match a sequence of labels or types. We describe a simple conditional lower bound that, for any constant $ε>0$, an $O(|E|^{1 - ε} \, m)$-time or an $O(|E| \, m^{1 - ε})$-time algorithm for exact pattern matching on graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. The result holds even if restricted to undirected graphs of maximum degree three or directed acyclic graphs of maximum sum of indegree and outdegree three. Although a conditional lower bound of this kind can be somehow derived from previous results (Backurs and Indyk, FOCS'16), we give a direct reduction from SETH for dissemination purposes, as the result might interest researchers from several areas, such as computational biology, graph database, and graph mining, as mentioned before. Indeed, as approximate pattern matching on graphs can be solved in $O(|E|\,m)$ time, exact and approximate matching are thus equally hard (quadratic time) on graphs under the SETH assumption. In comparison, the same problems restricted to strings have linear time vs quadratic time solutions, respectively, where the latter ones have a matching SETH lower bound on computing the edit distance of two strings (Backurs and Indyk, STOC'15).

preprint2020arXiv

Zuckerli: A New Compressed Representation for Graphs

Zuckerli is a scalable compression system meant for large real-world graphs. Graphs are notoriously challenging structures to store efficiently due to their linked nature, which makes it hard to separate them into smaller, compact components. Therefore, effective compression is crucial when dealing with large graphs, which can have billions of nodes and edges. Furthermore, a good compression system should give the user fast and reasonably flexible access to parts of the compressed data without requiring full decompression, which may be unfeasible on their system. Zuckerli improves multiple aspects of WebGraph, the current state-of-the-art in compressing real-world graphs, by using advanced compression techniques and novel heuristic graph algorithms. It can produce both a compressed representation for storage and one which allows fast direct access to the adjacency lists of the compressed graph without decompressing the entire graph. We validate the effectiveness of Zuckerli on real-world graphs with up to a billion nodes and 90 billion edges, conducting an extensive experimental evaluation of both compression density and decompression performance. We show that Zuckerli-compressed graphs are 10% to 29% smaller, and more than 20% in most cases, with a resource usage for decompression comparable to that of WebGraph.

preprint2019arXiv

Combinatorial Algorithms for String Sanitization

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge. In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by ``reversing'' the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching.

preprint2015arXiv

Enumerating Cyclic Orientations of a Graph

Acyclic and cyclic orientations of an undirected graph have been widely studied for their importance: an orientation is acyclic if it assigns a direction to each edge so as to obtain a directed acyclic graph (DAG) with the same vertex set; it is cyclic otherwise. As far as we know, only the enumeration of acyclic orientations has been addressed in the literature. In this paper, we pose the problem of efficiently enumerating all the \emph{cyclic} orientations of an undirected connected graph with $n$ vertices and $m$ edges, observing that it cannot be solved using algorithmic techniques previously employed for enumerating acyclic orientations.We show that the problem is of independent interest from both combinatorial and algorithmic points of view, and that each cyclic orientation can be listed with $\tilde{O}(m)$ delay time. Space usage is $O(m)$ with an additional setup cost of $O(n^2)$ time before the enumeration begins, or $O(mn)$ with a setup cost of $\tilde{O}(m)$ time.

preprint2014arXiv

Amortized $\tilde{O}(|V|)$-Delay Algorithm for Listing Chordless Cycles in Undirected Graphs

Chordless cycles are very natural structures in undirected graphs, with an important history and distinguished role in graph theory. Motivated also by previous work on the classical problem of listing cycles, we study how to list chordless cycles. The best known solution to list all the $C$ chordless cycles contained in an undirected graph $G = (V,E)$ takes $O(|E|^2 +|E|\cdot C)$ time. In this paper we provide an algorithm taking $\tilde{O}(|E| + |V |\cdot C)$ time. We also show how to obtain the same complexity for listing all the $P$ chordless $st$-paths in $G$ (where $C$ is replaced by $P$ ).

preprint2013arXiv

Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time $O(\log n)$ per input symbol (as opposed to amortized $O(\log n)$ time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves $O(\log n)$ worst case time per input symbol. Searching for a pattern of length $m$ in the resulting suffix tree takes $O(\min(m\log |Σ|, m + \log n) + tocc)$ time, where $tocc$ is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance.

preprint2012arXiv

Optimal Listing of Cycles and st-Paths in Undirected Graphs

We present the first optimal algorithm for the classical problem of listing all the cycles in an undirected graph. We exploit their properties so that the total cost is the time taken to read the input graph plus the time to list the output, namely, the edges in each of the cycles. The algorithm uses a reduction to the problem of listing all the paths from a vertex s to a vertex t which we also solve optimally.

preprint2012arXiv

The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence.

preprint2011arXiv

Consecutive Ones Property and PQ-Trees for Multisets: Hardness of Counting Their Orderings

A binary matrix satisfies the consecutive ones property (COP) if its columns can be permuted such that the ones in each row of the resulting matrix are consecutive. Equivalently, a family of sets F = {Q_1,..,Q_m}, where Q_i is subset of R for some universe R, satisfies the COP if the symbols in R can be permuted such that the elements of each set Q_i occur consecutively, as a contiguous segment of the permutation of R's symbols. We consider the COP version on multisets and prove that counting its solutions is difficult (#P-complete). We prove completeness results also for counting the frontiers of PQ-trees, which are typically used for testing the COP on sets, thus showing that a polynomial algorithm is unlikely to exist when dealing with multisets. We use a combinatorial approach based on parsimonious reductions from the Hamiltonian path problem, showing that the decisional version of our problems is therefore NP-complete.

preprint2011arXiv

Fast Compressed Tries through Path Decompositions

Tries are popular data structures for storing a set of strings, where common prefixes are represented by common root-to-node paths. Over fifty years of usage have produced many variants and implementations to overcome some of their limitations. We explore new succinct representations of path-decomposed tries and experimentally evaluate the corresponding reduction in space usage and memory latency, comparing with the state of the art. We study two cases of applications: (1) a compressed dictionary for (compressed) strings, and (2) a monotone minimal perfect hash for strings that preserves their lexicographic order. For (1), we obtain data structures that outperform other state-of-the-art compressed dictionaries in space efficiency, while obtaining predictable query times that are competitive with data structures preferred by the practitioners. In (2), our tries perform several times faster than other trie-based monotone perfect hash functions, while occupying nearly the same space.

preprint2011arXiv

Partial Data Compression and Text Indexing via Optimal Suffix Multi-Selection

Consider an input text string T[1,N] drawn from an unbounded alphabet. We study partial computation in suffix-based problems for Data Compression and Text Indexing such as (I) retrieve any segment of K<=N consecutive symbols from the Burrows-Wheeler transform of T, and (II) retrieve any chunk of K<=N consecutive entries of the Suffix Array or the Suffix Tree. Prior literature would take O(N log N) comparisons (and time) to solve these problems by solving the total problem of building the entire Burrows-Wheeler transform or Text Index for T, and performing a post-processing to single out the wanted portion. We introduce a novel adaptive approach to partial computational problems above, and solve both the partial problems in O(K log K + N) comparisons and time, improving the best known running times of O(N log N) for K=o(N). These partial-computation problems are intimately related since they share a common bottleneck: the suffix multi-selection problem, which is to output the suffixes of rank r_1,r_2,...,r_K under the lexicographic order, where r_1<r_2<...<r_K, r_i in [1,N]. Special cases of this problem are well known: K=N is the suffix sorting problem that is the workhorse in Stringology with hundreds of applications, and K=1 is the recently studied suffix selection. We show that suffix multi-selection can be solved in Theta(N log N - sum_{j=0}^K Delta_j log Delta_j+N) time and comparisons, where r_0=0, r_{K+1}=N+1, and Delta_j=r_{j+1}-r_j for 0<=j<=K. This is asymptotically optimal, and also matches the bound in [Dobkin, Munro, JACM 28(3)] for multi-selection on atomic elements (not suffixes). Matching the bound known for atomic elements for strings is a long running theme and challenge from 70's, which we achieve for the suffix multi-selection problem. The partial suffix problems as well as the suffix multi-selection problem have many applications.

preprint2010arXiv

MADMX: A Novel Strategy for Maximal Dense Motif Extraction

We develop, analyze and experiment with a new tool, called MADMX, which extracts frequent motifs, possibly including don't care characters, from biological sequences. We introduce density, a simple and flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., different from don't care) characters to the total length of the motif. By extracting only maximal dense motifs, MADMX reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by MADMX

preprint2010arXiv

Optimal Trade-Off for Succinct String Indexes

Let s be a string whose symbols are solely available through access(i), a read-only operation that probes s and returns the symbol at position i in s. Many compressed data structures for strings, trees, and graphs, require two kinds of queries on s: select(c, j), returning the position in s containing the jth occurrence of c, and rank(c, p), counting how many occurrences of c are found in the first p positions of s. We give matching upper and lower bounds for this problem, improving the lower bounds given by Golynski [Theor. Comput. Sci. 387 (2007)] [PhD thesis] and the upper bounds of Barbay et al. [SODA 2007]. We also present new results in another model, improving on Barbay et al. [SODA 2007] and matching a lower bound of Golynski [SODA 2009]. The main contribution of this paper is to introduce a general technique for proving lower bounds on succinct data structures, that is based on the access patterns of the supported operations, abstracting from the particular operations at hand. For this, it may find application to other interesting problems on succinct data structures.

Roberto Grossi

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Zuckerli: A New Compressed Representation for Graphs

Combinatorial Algorithms for String Sanitization

Enumerating Cyclic Orientations of a Graph

Amortized $\tilde{O}(|V|)$-Delay Algorithm for Listing Chordless Cycles in Undirected Graphs

Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

Optimal Listing of Cycles and st-Paths in Undirected Graphs

The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

Consecutive Ones Property and PQ-Trees for Multisets: Hardness of Counting Their Orderings

Fast Compressed Tries through Path Decompositions

Partial Data Compression and Text Indexing via Optimal Suffix Multi-Selection

MADMX: A Novel Strategy for Maximal Dense Motif Extraction

Optimal Trade-Off for Succinct String Indexes