Source author record

Michael Mitzenmacher

Michael Mitzenmacher appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

61works

25topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector databases. In practice, URRs are often replaced by randomized Hadamard transforms (RHTs), which preserve orthogonality while admitting fast implementations. The remaining issue is the performance for worst-case inputs. With a URR, each coordinate is individually distributed as a shifted beta distribution, which converges to a Gaussian distribution in high dimensions. Generally, one RHT is not suitable in the worst case, as individual coordinates can be far from these distributions. We show that after composing two RHTs on any $d$-sized input vector, the marginal distribution of every fixed coordinate of the normalized rotated vector is within $O(d^{-1/2})$ of a standard Gaussian both in Kolmogorov distance and in $1$-Wasserstein distance. We then plug these bounds into the analyses of modern compression schemes, namely DRIVE and QUIC-FL, and show that two RHTs achieve performance that asymptotically matches URRs. However, we show that two RHTs may not be sufficient for Vector Quantization (VQ), which often requires weak correlation across fixed-size blocks of coordinates (as opposed to only marginal distribution convergence for single coordinates). We prove that a composition of three RHTs leads to decaying coordinate covariance. This ensures that any fixed, bounded, multi-dimensional VQ codebook optimized for URRs has the same expected error when using three RHTs, up to an additive term that vanishes with the dimension. Finally, because practical inputs are rarely adversarial, we propose a linear-time ${O}(d)$ check on the input's moments to dynamically adapt the number of RHTs used at runtime to improve performance.

preprint2026arXiv

The Mixed Birth-death/death-Birth Moran Process

We study evolutionary dynamics on graphs in which each step consists of one birth and one death, also known as the Moran processes. There are two types of individuals: residents with fitness $1$ and mutants with fitness $r$. Two standard update rules are used in the literature. In Birth-death (Bd), a vertex is chosen to reproduce proportional to fitness, and one of its neighbors is selected uniformly at random to be replaced by the offspring. In death-Birth (dB), a vertex is chosen uniformly to die, and then one of its neighbors is chosen, proportional to fitness, to place an offspring into the vacancy. We formalize and study a unified model, the $λ$-mixed Moran process, in which each step is independently a Bd step with probability $λ\in [0,1]$ and a dB step otherwise. We analyze this mixed process for undirected, connected graphs. As an interesting special case, we show at $λ=1/2$, for any graph that the fixation probability when $r=1$ with a single mutant initially on the graph is exactly $1/n$, and also at $λ=1/2$ that the absorption time for any $r$ is $O_r(n^4)$. We also show results for graphs that are "almost regular," in a manner defined in the paper. We use this to show that for suitable random graphs from $G \sim G(n,p)$ and fixed $r>1$, with high probability over the choice of graph, the absorption time is $O_r(n^4)$, the fixation probability is $Ω_r(n^{-2})$, and we can approximate the fixation probability in polynomial time. Another special case is when the graph has only two distinct degree values $\{d_1, d_2\}$ with $d_1 \leq d_2$. For those graphs, we give exact formulas for fixation probabilities when $r = 1$ and any $λ$, and establish an absorption time of $O_r(n^4 α^4)$ for all $λ$, where $α= d_2 / d_1$. We also provide explicit formulas for the star and cycle under any $r$ or $λ$.

preprint2023arXiv

Analyzing Generalized Pólya Urn Models using Martingales, with an Application to Viral Evolution

The randomized play-the-winner (RPW) model is a generalized Pólya Urn process with broad applications ranging from clinical trials to molecular evolution. We derive an exact expression for the variance of the RPW model by transforming the Pólya Urn process into a martingale, correcting an earlier result of Matthews and Rosenberger (1997). We then use this result to approximate the full probability mass function of the RPW model for certain parameter values relevant to genetic applications. Finally, we fit our model to genomic sequencing data of SARS-CoV-2, demonstrating a novel method of estimating the viral mutation rate that delivers comparable results to existing scientific literature.

preprint2022arXiv

EDEN: Communication-Efficient and Robust Distributed Mean Estimation for Federated Learning

Distributed Mean Estimation (DME) is a central building block in federated learning, where clients send local gradients to a parameter server for averaging and updating the model. Due to communication constraints, clients often use lossy compression techniques to compress the gradients, resulting in estimation inaccuracies. DME is more challenging when clients have diverse network conditions, such as constrained communication budgets and packet losses. In such settings, DME techniques often incur a significant increase in the estimation error leading to degraded learning performance. In this work, we propose a robust DME technique named EDEN that naturally handles heterogeneous communication budgets and packet losses. We derive appealing theoretical guarantees for EDEN and evaluate it empirically. Our results demonstrate that EDEN consistently improves over state-of-the-art DME techniques.

preprint2022arXiv

Incentive Compatible Queues Without Money

For job scheduling systems, where jobs require some amount of processing and then leave the system, it is natural for each user to provide an estimate of their job's time requirement in order to aid the scheduler. However, if there is no incentive mechanism for truthfulness, each user will be motivated to provide estimates that give their job precedence in the schedule, so that the job completes as early as possible. We examine how to make such scheduling systems incentive compatible, without using monetary charges, under a natural queueing theory framework. In our setup, each user has an estimate of their job's running time, but it is possible for this estimate to be incorrect. We examine scheduling policies where if a job exceeds its estimate, it is with some probability "punished" and re-scheduled after other jobs, to disincentivize underestimates of job times. However, because user estimates may be incorrect (without any malicious intent), excessive punishment may incentivize users to overestimate their job times, which leads to less efficient scheduling. We describe two natural scheduling policies, BlindTrust and MeasuredTrust. We show that, for both of these policies, given the parameters of the system, we can efficiently determine the set of punishment probabilities that are incentive compatible, in that users are incentivized to provide their actual estimate of the job time. Moreover, we prove for MeasuredTrust that in the limit as estimates converge to perfect accuracy, the range of punishment probabilities that are incentive compatible converges to $[0,1]$. Our formalism establishes a framework for studying further queue-based scheduling problems where job time estimates from users are utilized, and the system needs to incentivize truthful reporting of estimates.

preprint2022arXiv

Proteus: A Self-Designing Range Filter

We introduce Proteus, a novel self-designing approximate range filter, which configures itself based on sampled data in order to optimize its false positive rate (FPR) for a given space requirement. Proteus unifies the probabilistic and deterministic design spaces of state-of-the-art range filters to achieve robust performance across a larger variety of use cases. At the core of Proteus lies our Contextual Prefix FPR (CPFPR) model - a formal framework for the FPR of prefix-based filters across their design spaces. We empirically demonstrate the accuracy of our model and Proteus' ability to optimize over both synthetic workloads and real-world datasets. We further evaluate Proteus in RocksDB and show that it is able to improve end-to-end performance by as much as 5.3x over more brittle state-of-the-art methods such as SuRF and Rosetta. Our experiments also indicate that the cost of modeling is not significant compared to the end-to-end performance gains and that Proteus is robust to workload shifts.

preprint2022arXiv

The Supermarket Model with Known and Predicted Service Times

The supermarket model refers to a system with a large number of queues, where new customers choose d queues at random and join the one with the fewest customers. This model demonstrates the power of even small amounts of choice, as compared to simply joining a queue chosen uniformly at random, for load balancing systems. In this work we perform simulation-based studies to consider variations where service times for a customer are predicted, as might be done in modern settings using machine learning techniques or related mechanisms. Our primary takeaway is that using even seemingly weak predictions of service times can yield significant benefits over blind First In First Out queueing in this context. However, some care must be taken when using predicted service time information to both choose a queue and order elements for service within a queue; while in many cases using the information for both choosing and ordering is beneficial, in many of our simulation settings we find that simply using the number of jobs to choose a queue is better when using predicted service times to order jobs in a queue. In our simulations, we evaluate both synthetic and real-world workloads--in the latter, service times are predicted by machine learning. Our results provide practical guidance for the design of real-world systems; moreover, we leave many natural theoretical open questions for future work, validating their relevance to real-world situations.

preprint2022arXiv

Uniform Bounds for Scheduling with Job Size Estimates

We consider the problem of scheduling to minimize mean response time in M/G/1 queues where only estimated job sizes (processing times) are known to the scheduler, where a job of true size $s$ has estimated size in the interval $[βs, αs]$ for some $α\geq β> 0$. We evaluate each scheduling policy by its approximation ratio, which we define to be the ratio between its mean response time and that of Shortest Remaining Processing Time (SRPT), the optimal policy when true sizes are known. Our question: is there a scheduling policy that (a) has approximation ratio near 1 when $α$ and $β$ are near 1, (b) has approximation ratio bounded by some function of $α$ and $β$ even when they are far from 1, and (c) can be implemented without knowledge of $α$ and $β$? We first show that naively running SRPT using estimated sizes in place of true sizes is not such a policy: its approximation ratio can be arbitrarily large for any fixed $β< 1$. We then provide a simple variant of SRPT for estimated sizes that satisfies criteria (a), (b), and (c). In particular, we prove its approximation ratio approaches 1 uniformly as $α$ and $β$ approach 1. This is the first result showing this type of convergence for M/G/1 scheduling. We also study the Preemptive Shortest Job First (PSJF) policy, a cousin of SRPT. We show that, unlike SRPT, naively running PSJF using estimated sizes in place of true sizes satisfies criteria (b) and (c), as well as a weaker version of (a).

preprint2021arXiv

Dynamic Longest Increasing Subsequence and the Erdös-Szekeres Partitioning Problem

In this paper, we provide new approximation algorithms for dynamic variations of the longest increasing subsequence (\textsf{LIS}) problem, and the complementary distance to monotonicity (\textsf{DTM}) problem. In this setting, operations of the following form arrive sequentially: (i) add an element, (ii) remove an element, or (iii) substitute an element for another. At every point in time, the algorithm has an approximation to the longest increasing subsequence (or distance to monotonicity). We present a $(1+ε)$-approximation algorithm for \textsf{DTM} with polylogarithmic worst-case update time and a constant factor approximation algorithm for \textsf{LIS} with worst-case update time $\tilde O(n^ε)$ for any constant $ε> 0$.% $n$ in the runtime denotes the size of the array at the time the operation arrives. Our dynamic algorithm for \textsf{LIS} leads to an almost optimal algorithm for the Erdös-Szekeres partitioning problem. Erdös-Szekeres partitioning problem was introduced by Erdös and Szekeres in 1935 and was known to be solvable in time $O(n^{1.5}\log n)$. Subsequent work improve the runtime to $O(n^{1.5})$ only in 1998. Our dynamic \textsf{LIS} algorithm leads to a solution for Erdös-Szekeres partitioning problem with runtime $\tilde O_ε(n^{1+ε})$ for any constant $ε> 0$.

preprint2021arXiv

SALSA: Self-Adjusting Lean Streaming Analytics

Counters are the fundamental building block of many data sketching schemes, which hash items to a small number of counters and account for collisions to provide good approximations for frequencies and other measures. Most existing methods rely on fixed-size counters, which may be wasteful in terms of space, as counters must be large enough to eliminate any risk of overflow. Instead, some solutions use small, fixed-size counters that may overflow into secondary structures. This paper takes a different approach. We propose a simple and general method called SALSA for dynamic re-sizing of counters and show its effectiveness. SALSA starts with small counters, and overflowing counters simply merge with their neighbors. SALSA can thereby allow more counters for a given space, expanding them as necessary to represent large numbers. Our evaluation demonstrates that, at the cost of a small overhead for its merging logic, SALSA significantly improves the accuracy of popular schemes (such as Count-Min Sketch and Count Sketch) over a variety of tasks. Our code is released as open-source [1].

preprint2020arXiv

Algorithms with Predictions

We introduce algorithms that use predictions from machine learning applied to the input to circumvent worst-case analysis. We aim for algorithms that have near optimal performance when these predictions are good, but recover the prediction-less worst case behavior when the predictions have large errors.

preprint2020arXiv

Faster and More Accurate Measurement through Additive-Error Counters

Counters are a fundamental building block for networking applications such as load balancing, traffic engineering, and intrusion detection, which require estimating flow sizes and identifying heavy hitter flows. Existing works suggest replacing counters with shorter multiplicative error \emph{estimators} that improve the accuracy by fitting more of them within a given space. However, such estimators impose a computational overhead that degrades the measurement throughput. Instead, we propose \emph{additive} error estimators, which are simpler, faster, and more accurate when used for network measurement. Our solution is rigorously analyzed and empirically evaluated against several other measurement algorithms on real Internet traces. For a given error target, we improve the speed of the uncompressed solutions by $5\times$-$30\times$, and the space by up to $4\times$. Compared with existing state-of-the-art estimators, our solution is $ 9\times$-$35\times$ faster while being considerably more accurate.

preprint2020arXiv

PINT: Probabilistic In-band Network Telemetry

Commodity network devices support adding in-band telemetry measurements into data packets, enabling a wide range of applications, including network troubleshooting, congestion control, and path tracing. However, including such information on packets adds significant overhead that impacts both flow completion times and application-level performance. We introduce PINT, an in-band telemetry framework that bounds the amount of information added to each packet. PINT encodes the requested data on multiple packets, allowing per-packet overhead limits that can be as low as one bit. We analyze PINT and prove performance bounds, including cases when multiple queries are running simultaneously. PINT is implemented in P4 and can be deployed on network devices. Using real topologies and traffic characteristics, we show that PINT concurrently enables applications such as congestion control, path tracing, and computing tail latencies, using only sixteen bits per packet, with performance comparable to the state of the art.

preprint2020arXiv

Queues with Small Advice

Motivated by recent work on scheduling with predicted job sizes, we consider the performance of scheduling algorithms with minimal advice, namely a single bit. Besides demonstrating the power of very limited advice, such schemes are quite natural. In the prediction setting, one bit of advice can be used to model a simple prediction as to whether a job is "large" or "small"; that is, whether a job is above or below a given threshold. Further, one-bit advice schemes can correspond to mechanisms that tell whether to put a job at the front or the back for the queue, a limitation which may be useful in many implementation settings. Finally, queues with a single bit of advice have a simple enough state that they can be analyzed in the limiting mean-field analysis framework for the power of two choices. Our work follows in the path of recent work by showing that even small amounts of even possibly inaccurate information can greatly improve scheduling performance.

preprint2016arXiv

2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search

The method of random projections has become a standard tool for machine learning, data mining, and search with massive data at Web scale. The effective use of random projections requires efficient coding schemes for quantizing (real-valued) projected data into integers. In this paper, we focus on a simple 2-bit coding scheme. In particular, we develop accurate nonlinear estimators of data similarity based on the 2-bit strategy. This work will have important practical applications. For example, in the task of near neighbor search, a crucial step (often called re-ranking) is to compute or estimate data similarities once a set of candidate data points have been identified by hash table techniques. This re-ranking step can take advantage of the proposed coding scheme and estimator. As a related task, in this paper, we also study a simple uniform quantization scheme for the purpose of building hash tables with projected data. Our analysis shows that typically only a small number of bits are needed. For example, when the target similarity level is high, 2 or 3 bits might be sufficient. When the target similarity level is not so high, it is preferable to use only 1 or 2 bits. Therefore, a 2-bit scheme appears to be overall a good choice for the task of sublinear time approximate near neighbor search via hash tables. Combining these results, we conclude that 2-bit random projections should be recommended for approximate near neighbor search and similarity estimation. Extensive experimental results are provided.

preprint2016arXiv

Analyzing Distributed Join-Idle-Queue: A Fluid Limit Approach

In the context of load balancing, Lu et al. introduced the distributed Join-Idle-Queue algorithm, where a group of dispatchers distribute jobs to a cluster of parallel servers. Each dispatcher maintains a queue of idle servers; when a job arrives to a dispatcher, it sends it to a server on its queue, or to a random server if the queue is empty. In turn, when a server has no jobs, it requests to be placed on the idle queue of a randomly chosen dispatcher. Although this algorithm was shown to be quite effective, the original asymptotic analysis makes simplifying assumptions that become increasingly inaccurate as the system load increases. Further, the analysis does not naturally generalize to interesting variations, such as having a server request to be placed on the idle queue of a dispatcher before it has completed all jobs, which can be beneficial under high loads. We provide a new asymptotic analysis of Join-Idle-Queue systems based on mean field fluid limit methods, deriving families of differential equations that describe these systems. Our analysis avoids previous simplifying assumptions, is empirically more accurate, and generalizes naturally to the variation described above, as well as other simple variations. Our theoretical and empirical analyses shed further light on the performance of Join-Idle-Queue, including potential performance pitfalls under high load.

preprint2016arXiv

Better bounds for coalescing-branching random walks

Coalescing-branching random walks, or {\em cobra walks} for short, are a natural variant of random walks on graphs that can model the spread of disease through contacts or the spread of information in networks. In a $k$-cobra walk, at each time step a subset of the vertices are active; each active vertex chooses $k$ random neighbors (sampled independently and uniformly with replacement) that become active at the next step, and these are the only active vertices at the next step. A natural quantity to study for cobra walks is the cover time, which corresponds to the expected time when all nodes have become infected or received the disseminated information. In this work, we extend previous results for cobra walks in multiple ways. We show that the cover time for the 2-cobra walk on $[0,n]^d$ is $O(n)$ (where the order notation hides constant factors that depend on $d$); previous work had shown the cover time was $O(n \cdot polylog(n))$. We show that the cover time for a 2-cobra walk on an $n$-vertex $d$-regular graph with conductance $ϕ_G$ is $O(ϕ_G^{-2} \log^2 n)$, significantly generalizing a previous result that held only for expander graphs with sufficiently high expansion. And finally we show that the cover time for a 2-cobra walk on a graph with $n$ vertices is always $O(n^{11/4} \log n)$; this is the first result showing that the bound of $Θ(n^3)$ for the worst-case cover time for random walks can be beaten using 2-cobra walks.

preprint2016arXiv

Hardness of Peeling with Stashes

The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than k and their adjacent edges are removed until no vertices of degree less than k are left. Often the question is whether the remaining hypergraph, the k-core, is empty or not. In some settings, it may be possible to remove either vertices or edges from the hypergraph before peeling, at some cost. For example, in hashing applications where keys correspond to edges and buckets to vertices, one might use an additional side data structure, commonly referred to as a stash, to separately handle some keys in order to avoid collisions. The natural question in such cases is to find the minimum number of edges (or vertices) that need to be stashed in order to realize an empty k-core. We show that both these problems are NP-complete for all $k \geq 2$ on graphs and regular hypergraphs, with the sole exception being that the edge variant of stashing is solvable in polynomial time for $k = 2$ on standard (2-uniform) graphs.

preprint2016arXiv

Models and Algorithms for Graph Watermarking

We introduce models and algorithmic foundations for graph watermarking. Our frameworks include security definitions and proofs, as well as characterizations when graph watermarking is algorithmically feasible, in spite of the fact that the general problem is NP-complete by simple reductions from the subgraph isomorphism or graph edit distance problems. In the digital watermarking of many types of files, an implicit step in the recovery of a watermark is the mapping of individual pieces of data, such as image pixels or movie frames, from one object to another. In graphs, this step corresponds to approximately matching vertices of one graph to another based on graph invariants such as vertex degree. Our approach is based on characterizing the feasibility of graph watermarking in terms of keygen, marking, and identification functions defined over graph families with known distributions. We demonstrate the strength of this approach with exemplary watermarking schemes for two random graph models, the classic Erdős-Rényi model and a random power-law graph model, both of which are used to model real-world networks.

preprint2016arXiv

Space Lower Bounds for Itemset Frequency Sketches

Given a database, computing the fraction of rows that contain a query itemset or determining whether this fraction is above some threshold are fundamental operations in data mining. A uniform sample of rows is a good sketch of the database in the sense that all sufficiently frequent itemsets and their approximate frequencies are recoverable from the sample, and the sketch size is independent of the number of rows in the original database. For many seemingly similar problems there are better sketching algorithms than uniform sampling. In this paper we show that for itemset frequency sketching this is not the case. That is, we prove that there exist classes of databases for which uniform sampling is a space optimal sketch for approximate itemset frequency analysis, up to constant or iterated-logarithmic factors.

preprint2016arXiv

Voronoi Choice Games

We study novel variations of Voronoi games and associated random processes that we call Voronoi choice games. These games provide a rich framework for studying questions regarding the power of small numbers of choices in multi-player, competitive scenarios, and they further lead to many interesting, non-trivial random processes that appear worthy of study. As an example of the type of problem we study, suppose a group of $n$ miners are staking land claims through the following process: each miner has $m$ associated points independently and uniformly distributed on an underlying space, so the $k$th miner will have associated points $p_{k1},p_{k2},\ldots,p_{km}$. Each miner chooses one of these points as the base point for their claim. Each miner obtains mining rights for the area of the square that is closest to their chosen base, that is, they obtain the Voronoi cell corresponding to their chosen point in the Voronoi diagram of the $n$ chosen points. Each player's goal is simply to maximize the amount of land under their control. What can we say about the players' strategy and the equilibria of such games? In our main result, we derive bounds on the expected number of pure Nash equilibria for a variation of the 1-dimensional game on the circle where a player owns the arc starting from their point and moving clockwise to the next point. This result uses interesting properties of random arc lengths on circles, and demonstrates the challenges in analyzing these kinds of problems. We also provide several other related results. In particular, for the 1-dimensional game on the circle, we show that a pure Nash equilibrium always exists when each player owns the part of the circle nearest to their point, but it is NP-hard to determine whether a pure Nash equilibrium exists in the variant when each player owns the arc starting from their point clockwise to the next point.

preprint2015arXiv

Invertible Bloom Lookup Tables

We present a version of the Bloom filter data structure that supports not only the insertion, deletion, and lookup of key-value pairs, but also allows a complete listing of its contents with high probability, as long the number of key-value pairs is below a designed threshold. Our structure allows the number of key-value pairs to greatly exceed this threshold during normal operation. Exceeding the threshold simply temporarily prevents content listing and reduces the probability of a successful lookup. If later entries are deleted to return the structure below the threshold, everything again functions appropriately. We also show that simple variations of our structure are robust to certain standard errors, such as the deletion of a key without a corresponding insertion or the insertion of two distinct values for a key. The properties of our structure make it suitable for several applications, including database and networking applications that we highlight.

preprint2015arXiv

More Analysis of Double Hashing for Balanced Allocations

With double hashing, for a key $x$, one generates two hash values $f(x)$ and $g(x)$, and then uses combinations $(f(x) +i g(x)) \bmod n$ for $i=0,1,2,...$ to generate multiple hash values in the range $[0,n-1]$ from the initial two. For balanced allocations, keys are hashed into a hash table where each bucket can hold multiple keys, and each key is placed in the least loaded of $d$ choices. It has been shown previously that asymptotically the performance of double hashing and fully random hashing is the same in the balanced allocation paradigm using fluid limit methods. Here we extend a coupling argument used by Lueker and Molodowitch to show that double hashing and ideal uniform hashing are asymptotically equivalent in the setting of open address hash tables to the balanced allocation setting, providing further insight into this phenomenon. We also discuss the potential for and bottlenecks limiting the use this approach for other multiple choice hashing schemes.

preprint2015arXiv

Simple Multi-Party Set Reconciliation

As users migrate information to cloud storage, many distributed cloud-based services use multiple loosely consistent replicas of user information to avoid the high overhead of more tightly coupled synchronization. Periodically, the information must be synchronized, or reconciled. One can place this problem in the theoretical framework of {\em set reconciliation}: two parties $A_1$ and $A_2$ each hold a set of keys, named $S_1$ and $S_2$ respectively, and the goal is for both parties to obtain $S_1 \cup S_2$. Typically, set reconciliation is interesting algorithmically when sets are large but the set difference $|S_1-S_2|+|S_2-S_1|$ is small. In this setting the focus is on accomplishing reconciliation efficiently in terms of communication; ideally, the communication should depend on the size of the set difference, and not on the size of the sets. In this paper, we extend recent approaches using Invertible Bloom Lookup Tables (IBLTs) for set reconciliation to the multi-party setting. In this setting there are three or more parties $A_1,A_2,\ldots,A_n$ holding sets of keys $S_1,S_2,\ldots,S_n$ respectively, and the goal is for all parties to obtain $\cup_i S_i$. This could of course be done by pairwise reconciliations, but we seek more effective methods. Our methodology uses network coding techniques in conjunction with IBLTs, allowing efficiency in network utilization along with efficiency obtained by passing messages of size $O(|\cup_i S_i - \cap_i S_i|)$. Further, our approach can function even if the number of parties is not exactly known in advance, and in many cases can be used to determine which parties contain keys not in the joint union. By connecting reconciliation with network coding, we can allow for substantially more efficient reconciliation methods that apply to a number of natural distributed computing problems.

preprint2015arXiv

Theoretical Foundations of Equitability and the Maximal Information Coefficient

The maximal information coefficient (MIC) is a tool for finding the strongest pairwise relationships in a data set with many variables (Reshef et al., 2011). MIC is useful because it gives similar scores to equally noisy relationships of different types. This property, called {\em equitability}, is important for analyzing high-dimensional data sets. Here we formalize the theory behind both equitability and MIC in the language of estimation theory. This formalization has a number of advantages. First, it allows us to show that equitability is a generalization of power against statistical independence. Second, it allows us to compute and discuss the population value of MIC, which we call MIC_*. In doing so we generalize and strengthen the mathematical results proven in Reshef et al. (2011) and clarify the relationship between MIC and mutual information. Introducing MIC_* also enables us to reason about the properties of MIC more abstractly: for instance, we show that MIC_* is continuous and that there is a sense in which it is a canonical "smoothing" of mutual information. We also prove an alternate, equivalent characterization of MIC_* that we use to state new estimators of it as well as an algorithm for explicitly computing it when the joint probability density function of a pair of random variables is known. Our hope is that this paper provides a richer theoretical foundation for MIC and equitability going forward. This paper will be accompanied by a forthcoming companion paper that performs extensive empirical analysis and comparison to other methods and discusses the practical aspects of both equitability and the use of MIC and its related statistics.

preprint2014arXiv

A New Approach to Analyzing Robin Hood Hashing

Robin Hood hashing is a variation on open addressing hashing designed to reduce the maximum search time as well as the variance in the search time for elements in the hash table. While the case of insertions only using Robin Hood hashing is well understood, the behavior with deletions has remained open. Here we show that Robin Hood hashing can be analyzed under the framework of finite-level finite-dimensional jump Markov chains. This framework allows us to re-derive some past results for the insertion-only case with some new insight, as well as provide a new analysis for a standard deletion model, where we alternate between deleting a random old key and inserting a new one. In particular, we show that a simple but apparently unstudied approach for handling deletions with Robin Hood hashing offers good performance even under high loads.

preprint2014arXiv

Balanced Allocations and Double Hashing

Double hashing has recently found more common usage in schemes that use multiple hash functions. In double hashing, for an item $x$, one generates two hash values $f(x)$ and $g(x)$, and then uses combinations $(f(x) +k g(x)) \bmod n$ for $k=0,1,2,...$ to generate multiple hash values from the initial two. We first perform an empirical study showing that, surprisingly, the performance difference between double hashing and fully random hashing appears negligible in the standard balanced allocation paradigm, where each item is placed in the least loaded of $d$ choices, as well as several related variants. We then provide theoretical results that explain the behavior of double hashing in this context.

preprint2014arXiv

Coding for Random Projections and Approximate Near Neighbor Search

This technical note compares two coding (quantization) schemes for random projections in the context of sub-linear time approximate near neighbor search. The first scheme is based on uniform quantization while the second scheme utilizes a uniform quantization plus a uniformly random offset (which has been popular in practice). The prior work compared the two schemes in the context of similarity estimation and training linear classifiers, with the conclusion that the step of random offset is not necessary and may hurt the performance (depending on the similarity level). The task of near neighbor search is related to similarity estimation with importance distinctions and requires own study. In this paper, we demonstrate that in the context of near neighbor search, the step of random offset is not needed either and may hurt the performance (sometimes significantly so, depending on the similarity and other parameters).

preprint2014arXiv

Multi-Party Set Reconciliation Using Characteristic Polynomials

In the standard set reconciliation problem, there are two parties $A_1$ and $A_2$, each respectively holding a set of elements $S_1$ and $S_2$. The goal is for both parties to obtain the union $S_1 \cup S_2$. In many distributed computing settings the sets may be large but the set difference $|S_1-S_2|+|S_2-S_1|$ is small. In these cases one aims to achieve reconciliation efficiently in terms of communication; ideally, the communication should depend on the size of the set difference, and not on the size of the sets. Recent work has considered generalizations of the reconciliation problem to multi-party settings, using a framework based on a specific type of linear sketch called an Invertible Bloom Lookup Table. Here, we consider multi-party set reconciliation using the alternative framework of characteristic polynomials, which have previously been used for efficient pairwise set reconciliation protocols, and compare their performance with Invertible Bloom Lookup Tables for these problems.

preprint2014arXiv

Parallel Peeling Algorithms

The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than k are removed until there are no vertices of degree less than k left. The remaining hypergraph is known as the k-core. In this paper, we analyze parallel peeling processes, where in each round, all vertices of degree less than k are removed. It is known that, below a specific edge density threshold, the k-core is empty with high probability. We show that, with high probability, below this threshold, only (log log n)/log(k-1)(r-1) + O(1) rounds of peeling are needed to obtain the empty k-core for r-uniform hypergraphs. Interestingly, we show that above this threshold, Omega(log n) rounds of peeling are required to find the non-empty k-core. Since most algorithms and data structures aim to peel to an empty k-core, this asymmetry appears fortunate. We verify the theoretical results both with simulation and with a parallel implementation using graphics processing units (GPUs). Our implementation provides insights into how to structure parallel peeling algorithms for efficiency in practice.

preprint2014arXiv

Wear Minimization for Cuckoo Hashing: How Not to Throw a Lot of Eggs into One Basket

We study wear-leveling techniques for cuckoo hashing, showing that it is possible to achieve a memory wear bound of $\log\log n+O(1)$ after the insertion of $n$ items into a table of size $Cn$ for a suitable constant $C$ using cuckoo hashing. Moreover, we study our cuckoo hashing method empirically, showing that it significantly improves on the memory wear performance for classic cuckoo hashing and linear probing in practice.

preprint2013arXiv

Coding for Random Projections

The method of random projections has become very popular for large-scale applications in statistical learning, information retrieval, bio-informatics and other applications. Using a well-designed coding scheme for the projected data, which determines the number of bits needed for each projected value and how to allocate these bits, can significantly improve the effectiveness of the algorithm, in storage cost as well as computational speed. In this paper, we study a number of simple coding schemes, focusing on the task of similarity estimation and on an application to training linear classifiers. We demonstrate that uniform quantization outperforms the standard existing influential method (Datar et. al. 2004). Indeed, we argue that in many cases coding with just a small number of bits suffices. Furthermore, we also develop a non-uniform 2-bit coding scheme that generally performs well in practice, as confirmed by our experiments on training linear support vector machines (SVM).

preprint2013arXiv

Equitability Analysis of the Maximal Information Coefficient, with Comparisons

A measure of dependence is said to be equitable if it gives similar scores to equally noisy relationships of different types. Equitability is important in data exploration when the goal is to identify a relatively small set of strongest associations within a dataset as opposed to finding as many non-zero associations as possible, which often are too many to sift through. Thus an equitable statistic, such as the maximal information coefficient (MIC), can be useful for analyzing high-dimensional data sets. Here, we explore both equitability and the properties of MIC, and discuss several aspects of the theory and practice of MIC. We begin by presenting an intuition behind the equitability of MIC through the exploration of the maximization and normalization steps in its definition. We then examine the speed and optimality of the approximation algorithm used to compute MIC, and suggest some directions for improving both. Finally, we demonstrate in a range of noise models and sample sizes that MIC is more equitable than natural alternatives, such as mutual information estimation and distance correlation.

preprint2012arXiv

An Economic Analysis of User-Privacy Options in Ad-Supported Services

We analyze the value to e-commerce website operators of offering privacy options to users, e.g., of allowing users to opt out of ad targeting. In particular, we assume that site operators have some control over the cost that a privacy option imposes on users and ask when it is to their advantage to make such costs low. We consider both the case of a single site and the case of multiple sites that compete both for users who value privacy highly and for users who value it less. One of our main results in the case of a single site is that, under normally distributed utilities, if a privacy-sensitive user is worth at least $\sqrt{2} - 1$ times as much to advertisers as a privacy-insensitive user, the site operator should strive to make the cost of a privacy option as low as possible. In the case of multiple sites, we show how a Prisoner's-Dilemma situation can arise: In the equilibrium in which both sites are obliged to offer a privacy option at minimal cost, both sites obtain lower revenue than they would if they colluded and neither offered a privacy option.

preprint2012arXiv

Anonymous Card Shuffling and its Applications to Parallel Mixnets

We study the question of how to shuffle $n$ cards when faced with an opponent who knows the initial position of all the cards {\em and} can track every card when permuted, {\em except} when one takes $K< n$ cards at a time and shuffles them in a private buffer "behind your back," which we call {\em buffer shuffling}. The problem arises naturally in the context of parallel mixnet servers as well as other security applications. Our analysis is based on related analyses of load-balancing processes. We include extensions to variations that involve corrupted servers and adversarially injected messages, which correspond to an opponent who can peek at some shuffles in the buffer and who can mark some number of the cards. In addition, our analysis makes novel use of a sum-of-squares metric for anonymity, which leads to improved performance bounds for parallel mixnets and can also be used to bound well-known existing anonymity measures.

preprint2012arXiv

Biff (Bloom Filter) Codes : Fast Error Correction for Large Data Sets

Large data sets are increasingly common in cloud and virtualized environments. For example, transfers of multiple gigabytes are commonplace, as are replicated blocks of such sizes. There is a need for fast error-correction or data reconciliation in such settings even when the expected number of errors is small. Motivated by such cloud reconciliation problems, we consider error-correction schemes designed for large data, after explaining why previous approaches appear unsuitable. We introduce Biff codes, which are based on Bloom filters and are designed for large data. For Biff codes with a message of length $L$ and $E$ errors, the encoding time is $O(L)$, decoding time is $O(L + E)$ and the space overhead is $O(E)$. Biff codes are low-density parity-check codes; they are similar to Tornado codes, but are designed for errors instead of erasures. Further, Biff codes are designed to be very simple, removing any explicit graph structures and based entirely on hash tables. We derive Biff codes by a simple reduction from a set reconciliation algorithm for a recently developed data structure, invertible Bloom lookup tables. While the underlying theory is extremely simple, what makes this code especially attractive is the ease with which it can be implemented and the speed of decoding. We present results from a prototype implementation that decodes messages of 1 million words with thousands of errors in well under a second.

preprint2012arXiv

Chernoff-Hoeffding Bounds for Markov Chains: Generalized and Simplified

We prove the first Chernoff-Hoeffding bounds for general nonreversible finite-state Markov chains based on the standard L_1 (variation distance) mixing-time of the chain. Specifically, consider an ergodic Markov chain M and a weight function f: [n] -> [0,1] on the state space [n] of M with mean mu = E_{v <- pi}[f(v)], where pi is the stationary distribution of M. A t-step random walk (v_1,...,v_t) on M starting from the stationary distribution pi has expected total weight E[X] = mu t, where X = sum_{i=1}^t f(v_i). Let T be the L_1 mixing-time of M. We show that the probability of X deviating from its mean by a multiplicative factor of delta, i.e., Pr [ |X - mu t| >= delta mu t ], is at most exp(-Omega(delta^2 mu t / T)) for 0 <= delta <= 1, and exp(-Omega(delta mu t / T)) for delta > 1. In fact, the bounds hold even if the weight functions f_i's for i in [t] are distinct, provided that all of them have the same mean mu. We also obtain a simplified proof for the Chernoff-Hoeffding bounds based on the spectral expansion lambda of M, which is the square root of the second largest eigenvalue (in absolute value) of M tilde{M}, where tilde{M} is the time-reversal Markov chain of M. We show that the probability Pr [ |X - mu t| >= delta mu t ] is at most exp(-Omega(delta^2 (1-lambda) mu t)) for 0 <= delta <= 1, and exp(-Omega(delta (1-lambda) mu t)) for delta > 1. Both of our results extend to continuous-time Markov chains, and to the case where the walk starts from an arbitrary distribution x, at a price of a multiplicative factor depending on the distribution x in the concentration bounds.

preprint2012arXiv

Continuous Time Channels with Interference

Khanna and Sudan \cite{KS11} studied a natural model of continuous time channels where signals are corrupted by the effects of both noise and delay, and showed that, surprisingly, in some cases both are not enough to prevent such channels from achieving unbounded capacity. Inspired by their work, we consider channels that model continuous time communication with adversarial delay errors. The sender is allowed to subdivide time into an arbitrarily large number $M$ of micro-units in which binary symbols may be sent, but the symbols are subject to unpredictable delays and may interfere with each other. We model interference by having symbols that land in the same micro-unit of time be summed, and we study $k$-interference channels, which allow receivers to distinguish sums up to the value $k$. We consider both a channel adversary that has a limit on the maximum number of steps it can delay each symbol, and a more powerful adversary that only has a bound on the average delay. We give precise characterizations of the threshold between finite and infinite capacity depending on the interference behavior and on the type of channel adversary: for max-bounded delay, the threshold is at $D_{\text{max}}=\ThetaM \log\min{k, M}))$, and for average bounded delay the threshold is at $D_{\text{avg}} = Θ(\sqrt{M \cdot \min\{k, M\}})$.

preprint2012arXiv

Practical Verified Computation with Streaming Interactive Proofs

When delegating computation to a service provider, as in cloud computing, we seek some reassurance that the output is correct and complete. Yet recomputing the output as a check is inefficient and expensive, and it may not even be feasible to store all the data locally. We are therefore interested in proof systems which allow a service provider to prove the correctness of its output to a streaming (sublinear space) user, who cannot store the full input or perform the full computation herself. Our approach is two-fold. First, we describe a carefully chosen instantiation of one of the most efficient general-purpose constructions for arbitrary computations (streaming or otherwise), due to Goldwasser, Kalai, and Rothblum. This requires several new insights to make the methodology more practical. Our main contribution is in achieving a prover who runs in time O(S(n) log S(n)), where S(n) is the size of an arithmetic circuit computing the function of interest. Our experimental results demonstrate that a practical general-purpose protocol for verifiable computation may be significantly closer to reality than previously realized. Second, we describe techniques that achieve genuine scalability for protocols fine-tuned for specific important problems in streaming and database processing. Focusing in particular on non-interactive protocols for problems ranging from matrix-vector multiplication to bipartite perfect matching, we build on prior work to achieve a prover who runs in nearly linear-time, while obtaining optimal tradeoffs between communication cost and the user's working memory. Existing techniques required (substantially) superlinear time for the prover. We argue that even if general-purpose methods improve, fine-tuned protocols will remain valuable in real-world settings for key problems, and hence special attention to specific problems is warranted.

preprint2012arXiv

The Groupon Effect on Yelp Ratings: A Root Cause Analysis

Daily deals sites such as Groupon offer deeply discounted goods and services to tens of millions of customers through geographically targeted daily e-mail marketing campaigns. In our prior work we observed that a negative side effect for merchants using Groupons is that, on average, their Yelp ratings decline significantly. However, this previous work was essentially observational, rather than explanatory. In this work, we rigorously consider and evaluate various hypotheses about underlying consumer and merchant behavior in order to understand this phenomenon, which we dub the Groupon effect. We use statistical analysis and mathematical modeling, leveraging a dataset we collected spanning tens of thousands of daily deals and over 7 million Yelp reviews. In particular, we investigate hypotheses such as whether Groupon subscribers are more critical than their peers, or whether some fraction of Groupon merchants provide significantly worse service to customers using Groupons. We suggest an additional novel hypothesis: reviews from Groupon subscribers are lower on average because such reviews correspond to real, unbiased customers, while the body of reviews on Yelp contain some fraction of reviews from biased or even potentially fake sources. Although we focus on a specific question, our work provides broad insights into both consumer and merchant behavior within the daily deals marketplace.

preprint2012arXiv

Verifiable Computation with Massively Parallel Interactive Proofs

As the cloud computing paradigm has gained prominence, the need for verifiable computation has grown increasingly urgent. The concept of verifiable computation enables a weak client to outsource difficult computations to a powerful, but untrusted, server. Protocols for verifiable computation aim to provide the client with a guarantee that the server performed the requested computations correctly, without requiring the client to perform the computations herself. By design, these protocols impose a minimal computational burden on the client. However, existing protocols require the server to perform a large amount of extra bookkeeping in order to enable a client to easily verify the results. Verifiable computation has thus remained a theoretical curiosity, and protocols for it have not been implemented in real cloud computing systems. Our goal is to leverage GPUs to reduce the server-side slowdown for verifiable computation. To this end, we identify abundant data parallelism in a state-of-the-art general-purpose protocol for verifiable computation, originally due to Goldwasser, Kalai, and Rothblum, and recently extended by Cormode, Mitzenmacher, and Thaler. We implement this protocol on the GPU, obtaining 40-120x server-side speedups relative to a state-of-the-art sequential implementation. For benchmark problems, our implementation reduces the slowdown of the server to factors of 100-500x relative to the original computations requested by the client. Furthermore, we reduce the already small runtime of the client by 100x. Similarly, we obtain 20-50x server-side and client-side speedups for related protocols targeted at specific streaming problems. We believe our results demonstrate the immediate practicality of using GPUs for verifiable computation, and more generally that protocols for verifiable computation have become sufficiently mature to deploy in real cloud computing systems.

preprint2011arXiv

A Month in the Life of Groupon

Groupon has become the latest Internet sensation, providing daily deals to customers in the form of discount offers for restaurants, ticketed events, appliances, services, and other items. We undertake a study of the economics of daily deals on the web, based on a dataset we compiled by monitoring Groupon over several weeks. We use our dataset to characterize Groupon deal purchases, and to glean insights about Groupon's operational strategy. Our focus is on purchase incentives. For the primary purchase incentive, price, our regression model indicates that demand for coupons is relatively inelastic, allowing room for price-based revenue optimization. More interestingly, mining our dataset, we find evidence that Groupon customers are sensitive to other, "soft", incentives, e.g., deal scheduling and duration, deal featuring, and limited inventory. Our analysis points to the importance of considering incentives other than price in optimizing deal sites and similar systems.

preprint2011arXiv

Cuckoo Hashing with Pages

Although cuckoo hashing has significant applications in both theoretical and practical settings, a relevant downside is that it requires lookups to multiple locations. In many settings, where lookups are expensive, cuckoo hashing becomes a less compelling alternative. One such standard setting is when memory is arranged in large pages, and a major cost is the number of page accesses. We propose the study of cuckoo hashing with pages, advocating approaches where each key has several possible locations, or cells, on a single page, and additional choices on a second backup page. We show experimentally that with k cell choices on one page and a single backup cell choice, one can achieve nearly the same loads as when each key has k+1 random cells to choose from, with most lookups requiring just one page access, even when keys are placed online using a simple algorithm. While our results are currently experimental, they suggest several interesting new open theoretical questions for cuckoo hashing with pages.

preprint2011arXiv

Daily Deals: Prediction, Social Diffusion, and Reputational Ramifications

Daily deal sites have become the latest Internet sensation, providing discounted offers to customers for restaurants, ticketed events, services, and other items. We begin by undertaking a study of the economics of daily deals on the web, based on a dataset we compiled by monitoring Groupon and LivingSocial sales in 20 large cities over several months. We use this dataset to characterize deal purchases; glean insights about operational strategies of these firms; and evaluate customers' sensitivity to factors such as price, deal scheduling, and limited inventory. We then marry our daily deals dataset with additional datasets we compiled from Facebook and Yelp users to study the interplay between social networks and daily deal sites. First, by studying user activity on Facebook while a deal is running, we provide evidence that daily deal sites benefit from significant word-of-mouth effects during sales events, consistent with results predicted by cascade models. Second, we consider the effects of daily deals on the longer-term reputation of merchants, based on their Yelp reviews before and after they run a daily deal. Our analysis shows that while the number of reviews increases significantly due to daily deals, average rating scores from reviewers who mention daily deals are 10% lower than scores of their peers on average.

preprint2011arXiv

External-Memory Multimaps

Many data structures support dictionaries, also known as maps or associative arrays, which store and manage a set of key-value pairs. A \emph{multimap} is generalization that allows multiple values to be associated with the same key. For example, the inverted file data structure that is used prevalently in the infrastructure supporting search engines is a type of multimap, where words are used as keys and document pointers are used as values. We study the multimap abstract data type and how it can be implemented efficiently online in external memory frameworks, with constant expected I/O performance. The key technique used to achieve our results is a combination of cuckoo hashing using buckets that hold multiple items with a multiqueue implementation to cope with varying numbers of values per key. Our external-memory results are for the standard two-level memory model.

preprint2011arXiv

Fully De-Amortized Cuckoo Hashing for Cache-Oblivious Dictionaries and Multimaps

A dictionary (or map) is a key-value store that requires all keys be unique, and a multimap is a key-value store that allows for multiple values to be associated with the same key. We design hashing-based indexing schemes for dictionaries and multimaps that achieve worst-case optimal performance for lookups and updates, with a small or negligible probability the data structure will require a rehash operation, depending on whether we are working in the the external-memory (I/O) model or one of the well-known versions of the Random Access Machine (RAM) model. One of the main features of our constructions is that they are \emph{fully de-amortized}, meaning that their performance bounds hold without one having to tune their constructions with certain performance parameters, such as the constant factors in the exponents of failure probabilities or, in the case of the external-memory model, the size of blocks or cache lines and the size of internal memory (i.e., our external-memory algorithms are cache oblivious). Our solutions are based on a fully de-amortized implementation of cuckoo hashing, which may be of independent interest. This hashing scheme uses two cuckoo hash tables, one "nested" inside the other, with one serving as a primary structure and the other serving as an auxiliary supporting queue/stash structure that is super-sized with respect to traditional auxiliary structures but nevertheless adds negligible storage to our scheme. This auxiliary structure allows the success probability for cuckoo hashing to be very high, which is useful in cryptographic or data-intensive applications.

preprint2011arXiv

Hierarchical Heavy Hitters with the Space Saving Algorithm

The Hierarchical Heavy Hitters problem extends the notion of frequent items to data arranged in a hierarchy. This problem has applications to network traffic monitoring, anomaly detection, and DDoS detection. We present a new streaming approximation algorithm for computing Hierarchical Heavy Hitters that has several advantages over previous algorithms. It improves on the worst-case time and space bounds of earlier algorithms, is conceptually simple and substantially easier to implement, offers improved accuracy guarantees, is easily adopted to a distributed or parallel setting, and can be efficiently implemented in commodity hardware such as ternary content addressable memory (TCAMs). We present experimental results showing that for parameters of primary practical interest, our two-dimensional algorithm is superior to existing algorithms in terms of speed and accuracy, and competitive in terms of space, while our one-dimensional algorithm is also superior in terms of speed and accuracy for a more limited range of parameters.

preprint2011arXiv

Information Dissemination via Random Walks in d-Dimensional Space

We study a natural information dissemination problem for multiple mobile agents in a bounded Euclidean space. Agents are placed uniformly at random in the $d$-dimensional space $\{-n, ..., n\}^d$ at time zero, and one of the agents holds a piece of information to be disseminated. All the agents then perform independent random walks over the space, and the information is transmitted from one agent to another if the two agents are sufficiently close. We wish to bound the total time before all agents receive the information (with high probability). Our work extends Pettarin et al.'s work (Infectious random walks, arXiv:1007.1604v2, 2011), which solved the problem for $d \leq 2$. We present tight bounds up to polylogarithmic factors for the case $d = 3$. (While our results extend to higher dimensions, for space and readability considerations we provide only the case $d=3$ here.) Our results show the behavior when $d \geq 3$ is qualitatively different from the case $d \leq 2$. In particular, as the ratio between the volume of the space and the number of agents varies, we show an interesting phase transition for three dimensions that does not occur in one or two dimensions.

preprint2011arXiv

Local cluster aggregation models of explosive percolation

We introduce perhaps the simplest models of graph evolution with choice that demonstrate discontinuous percolation transitions and can be analyzed via mathematical evolution equations. These models are local, in the sense that at each step of the process one edge is selected from a small set of potential edges sharing common vertices and added to the graph. We show that the evolution can be accurately described by a system of differential equations and that such models exhibit the discontinuous emergence of the giant component. Yet, they also obey scaling behaviors characteristic of continuous transitions, with scaling exponents that differ from the classic Erdos-Renyi model.

preprint2011arXiv

Oblivious RAM Simulation with Efficient Worst-Case Access Overhead

Oblivious RAM simulation is a method for achieving confidentiality and privacy in cloud computing environments. It involves obscuring the access patterns to a remote storage so that the manager of that storage cannot infer information about its contents. Existing solutions typically involve small amortized overheads for achieving this goal, but nevertheless involve potentially huge variations in access times, depending on when they occur. In this paper, we show how to de-amortize oblivious RAM simulations, so that each access takes a worst-case bounded amount of time.

preprint2011arXiv

Oblivious Storage with Low I/O Overhead

We study oblivious storage (OS), a natural way to model privacy-preserving data outsourcing where a client, Alice, stores sensitive data at an honest-but-curious server, Bob. We show that Alice can hide both the content of her data and the pattern in which she accesses her data, with high probability, using a method that achieves O(1) amortized rounds of communication between her and Bob for each data access. We assume that Alice and Bob exchange small messages, of size $O(N^{1/c})$, for some constant $c\ge2$, in a single round, where $N$ is the size of the data set that Alice is storing with Bob. We also assume that Alice has a private memory of size $2N^{1/c}$. These assumptions model real-world cloud storage scenarios, where trade-offs occur between latency, bandwidth, and the size of the client's private memory.

preprint2011arXiv

On the Zero-Error Capacity Threshold for Deletion Channels

We consider the zero-error capacity of deletion channels. Specifically, we consider the setting where we choose a codebook ${\cal C}$ consisting of strings of $n$ bits, and our model of the channel corresponds to an adversary who may delete up to $pn$ of these bits for a constant $p$. Our goal is to decode correctly without error regardless of the actions of the adversary. We consider what values of $p$ allow non-zero capacity in this setting. We suggest multiple approaches, one of which makes use of the natural connection between this problem and the problem of finding the expected length of the longest common subsequence of two random sequences.

preprint2011arXiv

Privacy-Preserving Access of Outsourced Data via Oblivious RAM Simulation

Suppose a client, Alice, has outsourced her data to an external storage provider, Bob, because he has capacity for her massive data set, of size n, whereas her private storage is much smaller--say, of size O(n^{1/r}), for some constant r > 1. Alice trusts Bob to maintain her data, but she would like to keep its contents private. She can encrypt her data, of course, but she also wishes to keep her access patterns hidden from Bob as well. We describe schemes for the oblivious RAM simulation problem with a small logarithmic or polylogarithmic amortized increase in access times, with a very high probability of success, while keeping the external storage to be of size O(n). To achieve this, our algorithmic contributions include a parallel MapReduce cuckoo-hashing algorithm and an external-memory dataoblivious sorting algorithm.

preprint2011arXiv

Privacy-Preserving Group Data Access via Stateless Oblivious RAM Simulation

We study the problem of providing privacy-preserving access to an outsourced honest-but-curious data repository for a group of trusted users. We show that such privacy-preserving data access is possible using a combination of probabilistic encryption, which directly hides data values, and stateless oblivious RAM simulation, which hides the pattern of data accesses. We give simulations that have only an $O(\log n)$ amortized time overhead for simulating a RAM algorithm, $\cal A$, that has a memory of size $n$, using a scheme that is data-oblivious with very high probability assuming the simulation has access to a private workspace of size $O(n^ν)$, for any given fixed constant $ν>0$. This simulation makes use of pseudorandom hash functions and is based on a novel hierarchy of cuckoo hash tables that all share a common stash. We also provide results from an experimental simulation of this scheme, showing its practicality. In addition, in a result that may be of some theoretical interest, we also show that one can eliminate the dependence on pseudorandom hash functions in our simulation while having the overhead rise to be $O(\log^2 n)$.

preprint2010arXiv

AMS Without 4-Wise Independence on Product Domains

In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that 4-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains $[n]^k$ by using the product of 4-wise independent functions on $[n]$. Our work extends that of Indyk and McGregor, who showed the result for $k = 2$. Their primary motivation was the problem of identifying correlations in data streams. In their model, a stream of pairs $(i,j) \in [n]^2$ arrive, giving a joint distribution $(X,Y)$, and they find approximation algorithms for how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close $X$ and $Y$ are to being independent. By using our technique, we obtain a new result for the problem of approximating the $\ell_2$ distance between the joint distribution and the product of the marginal distributions for $k$-ary vectors, instead of just pairs, in a single pass. Our analysis gives a randomized algorithm that is a $(1 \pm ε)$ approximation (with probability $1-δ$) that requires space logarithmic in $n$ and $m$ and proportional to $3^k$.

preprint2010arXiv

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. We present extensive experimental results to substantiate the effectiveness of our methodology.

preprint2010arXiv

Heapable Sequences and Subsequences

Let us call a sequence of numbers heapable if they can be sequentially inserted to form a binary tree with the heap property, where each insertion subsequent to the first occurs at a leaf of the tree, i.e. below a previously placed number. In this paper we consider a variety of problems related to heapable sequences and subsequences that do not appear to have been studied previously. Our motivation for introducing these concepts is two-fold. First, such problems correspond to natural extensions of the well-known secretary problem for hiring an organization with a hierarchical structure. Second, from a purely combinatorial perspective, our problems are interesting variations on similar longest increasing subsequence problems, a problem paradigm that has led to many deep mathematical connections. We provide several basic results. We obtain an efficient algorithm for determining the heapability of a sequence, and also prove that the question of whether a sequence can be arranged in a complete binary heap is NP-hard. Regarding subsequences we show that, with high probability, the longest heapable subsequence of a random permutation of n numbers has length (1 - o(1)) n, and a subsequence of length (1 - o(1)) n can in fact be found online with high probability. We similarly show that for a random permutation a subsequence that yields a complete heap of size αn for a constant αcan be found with high probability. Our work highlights the interesting structure underlying this class of subsequence problems, and we leave many further interesting variations open for future work.

preprint2010arXiv

Information Asymmetries in Pay-Per-Bid Auctions: How Swoopo Makes Bank

Innovative auction methods can be exploited to increase profits, with Shubik's famous "dollar auction" perhaps being the most widely known example. Recently, some mainstream e-commerce web sites have apparently achieved the same end on a much broader scale, by using "pay-per-bid" auctions to sell items, from video games to bars of gold. In these auctions, bidders incur a cost for placing each bid in addition to (or sometimes in lieu of) the winner's final purchase cost. Thus even when a winner's purchase cost is a small fraction of the item's intrinsic value, the auctioneer can still profit handsomely from the bid fees. Our work provides novel analyses for these auctions, based on both modeling and datasets derived from auctions at Swoopo.com, the leading pay-per-bid auction site. While previous modeling work predicts profit-free equilibria, we analyze the impact of information asymmetry broadly, as well as Swoopo features such as bidpacks and the Swoop It Now option specifically, to quantify the effects of imperfect information in these auctions. We find that even small asymmetries across players (cheaper bids, better estimates of other players' intent, different valuations of items, committed players willing to play "chicken") can increase the auction duration well beyond that predicted by previous work and thus skew the auctioneer's profit disproportionately. Finally, we discuss our findings in the context of a dataset of thousands of live auctions we observed on Swoopo, which enables us also to examine behavioral factors, such as the power of aggressive bidding. Ultimately, our findings show that even with fully rational players, if players overlook or are unaware any of these factors, the result is outsized profits for pay-per-bid auctioneers.

preprint2010arXiv

Streaming Graph Computations with a Helpful Advisor

Motivated by the trend to outsource work to commercial cloud computing services, we consider a variation of the streaming paradigm where a streaming algorithm can be assisted by a powerful helper that can provide annotations to the data stream. We extend previous work on such {\em annotation models} by considering a number of graph streaming problems. Without annotations, streaming algorithms for graph problems generally require significant memory; we show that for many standard problems, including all graph problems that can be expressed with totally unimodular integer programming formulations, only a constant number of hash values are needed for single-pass algorithms given linear-sized annotations. We also obtain a protocol achieving \textit{optimal} tradeoffs between annotation length and memory usage for matrix-vector multiplication; this result contributes to a trend of recent research on numerical linear algebra in streaming models.

preprint2010arXiv

Tight Thresholds for Cuckoo Hashing via XORSAT

We settle the question of tight thresholds for offline cuckoo hashing. The problem can be stated as follows: we have n keys to be hashed into m buckets each capable of holding a single key. Each key has k >= 3 (distinct) associated buckets chosen uniformly at random and independently of the choices of other keys. A hash table can be constructed successfully if each key can be placed into one of its buckets. We seek thresholds alpha_k such that, as n goes to infinity, if n/m <= alpha for some alpha < alpha_k then a hash table can be constructed successfully with high probability, and if n/m >= alpha for some alpha > alpha_k a hash table cannot be constructed successfully with high probability. Here we are considering the offline version of the problem, where all keys and hash values are given, so the problem is equivalent to previous models of multiple-choice hashing. We find the thresholds for all values of k > 2 by showing that they are in fact the same as the previously known thresholds for the random k-XORSAT problem. We then extend these results to the setting where keys can have differing number of choices, and provide evidence in the form of an algorithm for a conjecture extending this result to cuckoo hash tables that store multiple keys in a bucket.

preprint2008arXiv

Network coding meets TCP

We propose a mechanism that incorporates network coding into TCP with only minor changes to the protocol stack, thereby allowing incremental deployment. In our scheme, the source transmits random linear combinations of packets currently in the congestion window. At the heart of our scheme is a new interpretation of ACKs - the sink acknowledges every degree of freedom (i.e., a linear combination that reveals one unit of new information) even if it does not reveal an original packet immediately. Such ACKs enable a TCP-like sliding-window approach to network coding. Our scheme has the nice property that packet losses are essentially masked from the congestion control algorithm. Our algorithm therefore reacts to packet drops in a smooth manner, resulting in a novel and effective approach for congestion control over networks involving lossy links such as wireless links. Our experiments show that our algorithm achieves higher throughput compared to TCP in the presence of lossy wireless links. We also establish the soundness and fairness properties of our algorithm.

Michael Mitzenmacher

What is connected

Connect this record

See the researcher in context

Building this map preview

61 published item(s)

Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

The Mixed Birth-death/death-Birth Moran Process

Analyzing Generalized Pólya Urn Models using Martingales, with an Application to Viral Evolution

EDEN: Communication-Efficient and Robust Distributed Mean Estimation for Federated Learning

Incentive Compatible Queues Without Money

Proteus: A Self-Designing Range Filter

The Supermarket Model with Known and Predicted Service Times

Uniform Bounds for Scheduling with Job Size Estimates

Dynamic Longest Increasing Subsequence and the Erdös-Szekeres Partitioning Problem

SALSA: Self-Adjusting Lean Streaming Analytics

Algorithms with Predictions

Faster and More Accurate Measurement through Additive-Error Counters

PINT: Probabilistic In-band Network Telemetry

Queues with Small Advice

2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search

Analyzing Distributed Join-Idle-Queue: A Fluid Limit Approach

Better bounds for coalescing-branching random walks

Hardness of Peeling with Stashes

Models and Algorithms for Graph Watermarking

Space Lower Bounds for Itemset Frequency Sketches

Voronoi Choice Games

Invertible Bloom Lookup Tables

More Analysis of Double Hashing for Balanced Allocations

Simple Multi-Party Set Reconciliation

Theoretical Foundations of Equitability and the Maximal Information Coefficient

A New Approach to Analyzing Robin Hood Hashing

Balanced Allocations and Double Hashing

Coding for Random Projections and Approximate Near Neighbor Search

Multi-Party Set Reconciliation Using Characteristic Polynomials

Parallel Peeling Algorithms

Wear Minimization for Cuckoo Hashing: How Not to Throw a Lot of Eggs into One Basket

Coding for Random Projections

Equitability Analysis of the Maximal Information Coefficient, with Comparisons

An Economic Analysis of User-Privacy Options in Ad-Supported Services

Anonymous Card Shuffling and its Applications to Parallel Mixnets

Biff (Bloom Filter) Codes : Fast Error Correction for Large Data Sets

Chernoff-Hoeffding Bounds for Markov Chains: Generalized and Simplified

Continuous Time Channels with Interference

Practical Verified Computation with Streaming Interactive Proofs

The Groupon Effect on Yelp Ratings: A Root Cause Analysis

Verifiable Computation with Massively Parallel Interactive Proofs

A Month in the Life of Groupon

Cuckoo Hashing with Pages

Daily Deals: Prediction, Social Diffusion, and Reputational Ramifications

External-Memory Multimaps

Fully De-Amortized Cuckoo Hashing for Cache-Oblivious Dictionaries and Multimaps

Hierarchical Heavy Hitters with the Space Saving Algorithm

Information Dissemination via Random Walks in d-Dimensional Space

Local cluster aggregation models of explosive percolation

Oblivious RAM Simulation with Efficient Worst-Case Access Overhead

Oblivious Storage with Low I/O Overhead

On the Zero-Error Capacity Threshold for Deletion Channels

Privacy-Preserving Access of Outsourced Data via Oblivious RAM Simulation

Privacy-Preserving Group Data Access via Stateless Oblivious RAM Simulation

AMS Without 4-Wise Independence on Product Domains

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Heapable Sequences and Subsequences

Information Asymmetries in Pay-Per-Bid Auctions: How Swoopo Makes Bank

Streaming Graph Computations with a Helpful Advisor

Tight Thresholds for Cuckoo Hashing via XORSAT

Network coding meets TCP