Source author record

Martin Farach-Colton

Martin Farach-Colton appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Computational Complexity Databases Networking and Internet Architecture

Catalog footprint

What is connected

9works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

GraphZeppelin: Storage-Friendly Sketching for Connected Components on Dynamic Graph Streams

Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large, or when they are dynamic, meaning the edge set changes over time subject to a stream of edge insertions and deletions. A natural approach to computing the connected components on a large, dynamic graph stream is to buy enough RAM to store the entire graph. However, the requirement that the graph fit in RAM is prohibitive for very large graphs. Thus, there is an unmet need for systems that can process dense dynamic graphs, especially when those graphs are larger than available RAM. We present a new high-performance streaming graph-processing system for computing the connected components of a graph. This system, which we call GraphZeppelin, uses new linear sketching data structures (CubeSketches) to solve the streaming connected components problem and as a result requires space asymptotically smaller than the space required for a lossless representation of the graph. GraphZeppelin is optimized for massive dense graphs: GraphZeppelin can process millions of edge updates (both insertions and deletions) per second, even when the underlying graph is far too large to fit in available RAM. As a result GraphZeppelin vastly increases the scale of graphs that can be processed.

preprint2022arXiv

Tight Bounds for Monotone Minimal Perfect Hashing

The monotone minimal perfect hash function (MMPHF) problem is the following indexing problem. Given a set $S= \{s_1,\ldots,s_n\}$ of $n$ distinct keys from a universe $U$ of size $u$, create a data structure $DS$ that answers the following query: \[ RankOp(q) = \text{rank of } q \text{ in } S \text{ for all } q\in S ~\text{ and arbitrary answer otherwise.} \] Solutions to the MMPHF problem are in widespread use in both theory and practice. The best upper bound known for the problem encodes $DS$ in $O(n\log\log\log u)$ bits and performs queries in $O(\log u)$ time. It has been an open problem to either improve the space upper bound or to show that this somewhat odd looking bound is tight. In this paper, we show the latter: specifically that any data structure (deterministic or randomized) for monotone minimal perfect hashing of any collection of $n$ elements from a universe of size $u$ requires $Ω(n \cdot \log\log\log{u})$ expected bits to answer every query correctly. We achieve our lower bound by defining a graph $\mathbf{G}$ where the nodes are the possible ${u \choose n}$ inputs and where two nodes are adjacent if they cannot share the same $DS$. The size of $DS$ is then lower bounded by the log of the chromatic number of $\mathbf{G}$. Finally, we show that the fractional chromatic number (and hence the chromatic number) of $\mathbf{G}$ is lower bounded by $2^{Ω(n \log\log\log u)}$.

preprint2020arXiv

Streaming Complexity of Spanning Tree Computation

The semi-streaming model is a variant of the streaming model frequently used for the computation of graph problems. It allows the edges of an $n$-node input graph to be read sequentially in $p$ passes using $\tilde{O}(n)$ space. In this model, some graph problems, such as spanning trees and $k$-connectivity, can be exactly solved in a single pass; while other graph problems, such as triangle detection and unweighted all-pairs shortest paths, are known to require $\tildeΩ(n)$ passes to compute. For many fundamental graph problems, the tractability in these models is open. In this paper, we study the tractability of computing some standard spanning trees. Our results are: (1) Maximum-Leaf Spanning Trees. This problem is known to be APX-complete with inapproximability constant $ρ\in[245/244,2)$. By constructing an $\varepsilon$-MLST sparsifier, we show that for every constant $\varepsilon > 0$, MLST can be approximated in a single pass to within a factor of $1+\varepsilon$ w.h.p. (albeit in super-polynomial time for $\varepsilon \le ρ-1$ assuming $\mathrm{P} \ne \mathrm{NP}$). (2) BFS Trees. It is known that BFS trees require $ω(1)$ passes to compute, but the naïve approach needs $O(n)$ passes. We devise a new randomized algorithm that reduces the pass complexity to $O(\sqrt{n})$, and it offers a smooth tradeoff between pass complexity and space usage. (3) DFS Trees. The current best algorithm by Khan and Mehta {[}STACS 2019{]} takes $\tilde{O}(h)$ passes, where $h$ is the height of computed DFS trees. Our contribution is twofold. First, we provide a simple alternative proof of this result, via a new connection to sparse certificates for $k$-node-connectivity. Second, we present a randomized algorithm that reduces the pass complexity to $O(\sqrt{n})$, and it also offers a smooth tradeoff between pass complexity and space usage.

preprint2015arXiv

On the complexity of computing prime tables

Many large arithmetic computations rely on tables of all primes less than $n$. For example, the fastest algorithms for computing $n!$ takes time $O(M(n\log n) + P(n))$, where $M(n)$ is the time to multiply two $n$-bit numbers, and $P(n)$ is the time to compute a prime table up to $n$. The fastest algorithm to compute $\binom{n}{n/2}$ also uses a prime table. We show that it takes time $O(M(n) + P(n))$. In various models, the best bound on $P(n)$ is greater than $M(n\log n)$, given advances in the complexity of multiplication \cite{Furer07,De08}. In this paper, we give two algorithms to computing prime tables and analyze their complexity on a multitape Turing machine, one of the standard models for analyzing such algorithms. These two algorithms run in time $O(M(n\log n))$ and $O(n\log^2 n/\log \log n)$, respectively. We achieve our results by speeding up Atkin's sieve. Given that the current best bound on $M(n)$ is $n\log n 2^{O(\log^*n)}$, the second algorithm is faster and improves on the previous best algorithm by a factor of $\log^2\log n$. Our fast prime-table algorithms speed up both the computation of $n!$ and $\binom{n}{n/2}$. Finally, we show that computing the factorial takes $Ω(M(n \log^{4/7 - ε} n))$ for any constant $ε> 0$ assuming only multiplication is allowed.

preprint2014arXiv

Dynamic Windows Scheduling with Reallocation

We consider the Windows Scheduling problem. The problem is a restricted version of Unit-Fractions Bin Packing, and it is also called Inventory Replenishment in the context of Supply Chain. In brief, the problem is to schedule the use of communication channels to clients. Each client ci is characterized by an active cycle and a window wi. During the period of time that any given client ci is active, there must be at least one transmission from ci scheduled in any wi consecutive time slots, but at most one transmission can be carried out in each channel per time slot. The goal is to minimize the number of channels used. We extend previous online models, where decisions are permanent, assuming that clients may be reallocated at some cost. We assume that such cost is a constant amount paid per reallocation. That is, we aim to minimize also the number of reallocations. We present three online reallocation algorithms for Windows Scheduling. We evaluate experimentally these protocols showing that, in practice, all three achieve constant amortized reallocations with close to optimal channel usage. Our simulations also expose interesting trade-offs between reallocations and channel usage. We introduce a new objective function for WS with reallocations, that can be also applied to models where reallocations are not possible. We analyze this metric for one of the algorithms which, to the best of our knowledge, is the first online WS protocol with theoretical guarantees that applies to scenarios where clients may leave and the analysis is against current load rather than peak load. Using previous results, we also observe bounds on channel usage for one of the algorithms.

preprint2013arXiv

Reallocation Problems in Scheduling

In traditional on-line problems, such as scheduling, requests arrive over time, demanding available resources. As each request arrives, some resources may have to be irrevocably committed to servicing that request. In many situations, however, it may be possible or even necessary to reallocate previously allocated resources in order to satisfy a new request. This reallocation has a cost. This paper shows how to service the requests while minimizing the reallocation cost. We focus on the classic problem of scheduling jobs on a multiprocessor system. Each unit-size job has a time window in which it can be executed. Jobs are dynamically added and removed from the system. We provide an algorithm that maintains a valid schedule, as long as a sufficiently feasible schedule exists. The algorithm reschedules only a total number of O(min{log^* n, log^* Delta}) jobs for each job that is inserted or deleted from the system, where n is the number of active jobs and Delta is the size of the largest window.

preprint2012arXiv

Don't Thrash: How to Cache Your Hash on Flash

This paper presents new alternatives to the well-known Bloom filter data structure. The Bloom filter, a compact data structure supporting set insertion and membership queries, has found wide application in databases, storage systems, and networks. Because the Bloom filter performs frequent random reads and writes, it is used almost exclusively in RAM, limiting the size of the sets it can represent. This paper first describes the quotient filter, which supports the basic operations of the Bloom filter, achieving roughly comparable performance in terms of space and time, but with better data locality. Operations on the quotient filter require only a small number of contiguous accesses. The quotient filter has other advantages over the Bloom filter: it supports deletions, it can be dynamically resized, and two quotient filters can be efficiently merged. The paper then gives two data structures, the buffered quotient filter and the cascade filter, which exploit the quotient filter advantages and thus serve as SSD-optimized alternatives to the Bloom filter. The cascade filter has better asymptotic I/O performance than the buffered quotient filter, but the buffered quotient filter outperforms the cascade filter on small to medium data sets. Both data structures significantly outperform recently-proposed SSD-optimized Bloom filter variants, such as the elevator Bloom filter, buffered Bloom filter, and forest-structured Bloom filter. In experiments, the cascade filter and buffered quotient filter performed insertions 8.6-11 times faster than the fastest Bloom filter variant and performed lookups 0.94-2.56 times faster.

preprint2011arXiv

Fault-Tolerant Aggregation: Flow-Updating Meets Mass-Distribution

Flow-Updating (FU) is a fault-tolerant technique that has proved to be efficient in practice for the distributed computation of aggregate functions in communication networks where individual processors do not have access to global information. Previous distributed aggregation protocols, based on repeated sharing of input values (or mass) among processors, sometimes called Mass-Distribution (MD) protocols, are not resilient to communication failures (or message loss) because such failures yield a loss of mass. In this paper, we present a protocol which we call Mass-Distribution with Flow-Updating (MDFU). We obtain MDFU by applying FU techniques to classic MD. We analyze the convergence time of MDFU showing that stochastic message loss produces low overhead. This is the first convergence proof of an FU-based algorithm. We evaluate MDFU experimentally, comparing it with previous MD and FU protocols, and verifying the behavior predicted by the analysis. Finally, given that MDFU incurs a fixed deviation proportional to the message-loss rate, we adjust the accuracy of MDFU heuristically in a new protocol called MDFU with Linear Prediction (MDFU-LP). The evaluation shows that both MDFU and MDFU-LP behave very well in practice, even under high rates of message loss and even changing the input values dynamically.

preprint2011arXiv

Opportunistic Information Dissemination in Mobile Ad-hoc Networks: adaptiveness vs. obliviousness and randomization vs. determinism

In this paper the problem of information dissemination in Mobile Ad-hoc Networks (MANET) is studied. The problem is to disseminate a piece of information, initially held by a distinguished source node, to all nodes in a set defined by some predicate. We use a model of MANETs that is well suited for dynamic networks and opportunistic communication. In this model nodes are placed in a plane, in which they can move with bounded speed, and communication between nodes occurs over a collision-prone single channel. In this setup informed and uninformed nodes can be disconnected for some time (bounded by a parameter alpha), but eventually some uninformed node must become neighbor of an informed node and remain so for some time (bounded by a parameter beta). In addition, nodes can start at different times, and they can crash and recover. Under the above framework, we show negative and positive results for different types of randomized protocols, and we put those results in perspective with respect to previous deterministic results.

Martin Farach-Colton

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

GraphZeppelin: Storage-Friendly Sketching for Connected Components on Dynamic Graph Streams

Tight Bounds for Monotone Minimal Perfect Hashing

Streaming Complexity of Spanning Tree Computation

On the complexity of computing prime tables

Dynamic Windows Scheduling with Reallocation

Reallocation Problems in Scheduling

Don't Thrash: How to Cache Your Hash on Flash

Fault-Tolerant Aggregation: Flow-Updating Meets Mass-Distribution

Opportunistic Information Dissemination in Mobile Ad-hoc Networks: adaptiveness vs. obliviousness and randomization vs. determinism