Source author record

Edith Cohen

Edith Cohen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Databases Machine Learning Social and Information Networks math.ST Statistics Theory Networking and Internet Architecture Computer Science and Game Theory Computation Cryptography and Security Distributed, Parallel, and Cluster Computing Information Retrieval

Catalog footprint

What is connected

26works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Stochastic Matching via Local Sparsification

The classic online stochastic matching problem typically requires immediate and irrevocable matching decisions. However, in many modern decentralized systems such as real-time ride-hailing and distributed cloud computing, the primary bottleneck is often local communication bandwidth rather than the timing of the match itself. We formalize this challenge by introducing a two-stage local sparsification framework. In this setting, arriving requests must prune their realized compatibility sets to a strict budget of $k$ edges before a central coordinator optimizes the global matching. This creates a "middle ground" between local information constraints and global optimization utility. We propose a local selection strategy, parametrized by a fractional solution of the expected instance. Theoretically, we quantify the approximation ratio as a function of the solution's {\em spread}. We prove that under sufficient spread, our sparsifier globally preserves the expected size of the maximum matching. Empirically, we demonstrate the robustness of our approach using the New York City ride-hailing datasets and adversarial synthetic benchmarks. Our results show that near-optimal global matching is achievable even with highly constrained local budgets, significantly outperforming standard online baselines.

preprint2022arXiv

On the Robustness of CountSketch to Adaptive Inputs

CountSketch is a popular dimensionality reduction technique that maps vectors to a lower dimension using randomized linear measurements. The sketch supports recovering $\ell_2$-heavy hitters of a vector (entries with $v[i]^2 \geq \frac{1}{k}\|\boldsymbol{v}\|^2_2$). We study the robustness of the sketch in adaptive settings where input vectors may depend on the output from prior inputs. Adaptive settings arise in processes with feedback or with adversarial attacks. We show that the classic estimator is not robust, and can be attacked with a number of queries of the order of the sketch size. We propose a robust estimator (for a slightly modified sketch) that allows for quadratic number of queries in the sketch size, which is an improvement factor of $\sqrt{k}$ (for $k$ heavy hitters) over prior work.

preprint2022arXiv

Tricking the Hashing Trick: A Tight Lower Bound on the Robustness of CountSketch to Adaptive Inputs

CountSketch and Feature Hashing (the "hashing trick") are popular randomized dimensionality reduction methods that support recovery of $\ell_2$-heavy hitters (keys $i$ where $v_i^2 > ε\|\boldsymbol{v}\|_2^2$) and approximate inner products. When the inputs are {\em not adaptive} (do not depend on prior outputs), classic estimators applied to a sketch of size $O(\ell/ε)$ are accurate for a number of queries that is exponential in $\ell$. When inputs are adaptive, however, an adversarial input can be constructed after $O(\ell)$ queries with the classic estimator and the best known robust estimator only supports $\tilde{O}(\ell^2)$ queries. In this work we show that this quadratic dependence is in a sense inherent: We design an attack that after $O(\ell^2)$ queries produces an adversarial input vector whose sketch is highly biased. Our attack uses "natural" non-adaptive inputs (only the final adversarial input is chosen adaptively) and universally applies with any correct estimator, including one that is unknown to the attacker. In that, we expose inherent vulnerability of this fundamental method.

preprint2020arXiv

Graph Learning with Loss-Guided Training

Classically, ML models trained with stochastic gradient descent (SGD) are designed to minimize the average loss per example and use a distribution of training examples that remains {\em static} in the course of training. Research in recent years demonstrated, empirically and theoretically, that significant acceleration is possible by methods that dynamically adjust the training distribution in the course of training so that training is more focused on examples with higher loss. We explore {\em loss-guided training} in a new domain of node embedding methods pioneered by {\sc DeepWalk}. These methods work with implicit and large set of positive training examples that are generated using random walks on the input graph and therefore are not amenable for typical example selection methods. We propose computationally efficient methods that allow for loss-guided training in this framework. Our empirical evaluation on a rich collection of datasets shows significant acceleration over the baseline static methods, both in terms of total training performed and overall computation.

preprint2020arXiv

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: it provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR $\ell_p$ sampling, weighted sampling of keys according to a power $p\in[0,2]$ of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with the sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as CountSketch. Our method is the first to provide WOR sampling in the important regime of $p>1$ and the first to handle signed updates for $p>0$.

preprint2016arXiv

Distance-Based Influence in Networks: Computation and Maximization

A premise at a heart of network analysis is that entities in a network derive utilities from their connections. The {\em influence} of a seed set $S$ of nodes is defined as the sum over nodes $u$ of the {\em utility} of $S$ to $u$. {\em Distance-based} utility, which is a decreasing function of the distance from $S$ to $u$, was explored in several successful research threads from social network analysis and economics: Network formation games [Bloch andJackson 2007], Reachability-based influence [Richardson and Domingos 2002, Kempe et al. 2003], "threshold" influence [Gomez-Rodriguez et al. 2011], and {\em closeness centrality} [Bavelas 1948]. We formulate a model that unifies and extends this previous work and address the two fundamental computational problems in this domain: {\em Influence oracles} and {\em influence maximization} (IM). An oracle performs some preprocessing, after which influence queries for arbitrary seed sets can be efficiently computed. With IM, we seek a set of nodes of a given size with maximum influence. Since the IM problem is computationally hard, we instead seek a {\em greedy sequence} of nodes, with each prefix having influence that is at least $1-1/e$ of that of the optimal seed set of the same size. We present the first highly scalable algorithms for both problems, providing statistical guarantees on approximation quality and near-linear worst-case bounds on the computation. We perform an experimental evaluation which demonstrates the effectiveness of our designs on networks with hundreds of millions of edges.

preprint2016arXiv

Greedy Maximization Framework for Graph-based Influence Functions

The study of graph-based submodular maximization problems was initiated in a seminal work of Kempe, Kleinberg, and Tardos (2003): An {\em influence} function of subsets of nodes is defined by the graph structure and the aim is to find subsets of seed nodes with (approximately) optimal tradeoff of size and influence. Applications include viral marketing, monitoring, and active learning of node labels. This powerful formulation was studied for (generalized) {\em coverage} functions, where the influence of a seed set on a node is the maximum utility of a seed item to the node, and for pairwise {\em utility} based on reachability, distances, or reverse ranks. We define a rich class of influence functions which unifies and extends previous work beyond coverage functions and specific utility functions. We present a meta-algorithm for approximate greedy maximization with strong approximation quality guarantees and worst-case near-linear computation for all functions in our class. Our meta-algorithm generalizes a recent design by Cohen et al (2014) that was specific for distance-based coverage functions.

preprint2016arXiv

Reverse Ranking by Graph Structure: Model and Scalable Algorithms

Distances in a network capture relations between nodes and are the basis of centrality, similarity, and influence measures. Often, however, the relevance of a node $u$ to a node $v$ is more precisely measured not by the magnitude of the distance, but by the number of nodes that are closer to $v$ than $u$. That is, by the {\em rank} of $u$ in an ordering of nodes by increasing distance from $v$. We identify and address fundamental challenges in rank-based graph mining. We first consider single-source computation of reverse-ranks and design a "Dijkstra-like" algorithm which computes nodes in order of increasing approximate reverse rank while only traversing edges adjacent to returned nodes. We then define {\em reverse-rank influence}, which naturally extends reverse nearest neighbors influence [Korn and Muthukrishnan 2000] and builds on a well studied distance-based influence. We present near-linear algorithms for greedy approximate reverse-rank influence maximization. The design relies on our single-source algorithm. Our algorithms utilize near-linear preprocessing of the network to compute all-distance sketches. As a contribution of independent interest, we present a novel algorithm for computing these sketches, which have many other applications, on multi-core architectures. We complement our algorithms by establishing the hardness of computing {\em exact} reverse-ranks for a single source and {\em exact} reverse-rank influence. This implies that when using near-linear algorithms, the small relative errors we obtain are the best we can currently hope for. Finally, we conduct an experimental evaluation on graphs with tens of millions of edges, demonstrating both scalability and accuracy.

preprint2015arXiv

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis

Graph datasets with billions of edges, such as social and Web graphs, are prevalent, and scalable computation is critical. All-distances sketches (ADS) [Cohen 1997], are a powerful tool for scalable approximation of statistics. The sketch is a small size sample of the distance relation of a node which emphasizes closer nodes. Sketches for all nodes are computed using a nearly linear computation and estimators are applied to sketches of nodes to estimate their properties. We provide, for the first time, a unified exposition of ADS algorithms and applications. We present the Historic Inverse Probability (HIP) estimators which are applied to the ADS of a node to estimate a large natural class of statistics. For the important special cases of neighborhood cardinalities (the number of nodes within some query distance) and closeness centralities, HIP estimators have at most half the variance of previous estimators and we show that this is essentially optimal. Moreover, HIP obtains a polynomial improvement for more general statistics and the estimators are simple, flexible, unbiased, and elegant. For approximate distinct counting on data streams, HIP outperforms the original estimators for the HyperLogLog MinHash sketches (Flajolet et al. 2007), obtaining significantly improved estimation quality for this state-of-the-art practical algorithm.

preprint2015arXiv

Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees

The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental quantity in data analysis. The inverse of the average distance, known as the (classic) closeness centrality of a node, is a popular importance measure in the study of social networks. We develop novel structural insights on the sparsifiability of the distance relation via weighted sampling. Based on that, we present highly practical algorithms with strong statistical guarantees for fundamental problems. We show that the average distance (and hence the centrality) for all nodes in a graph can be estimated using $O(ε^{-2})$ single-source distance computations. For a set $V$ of $n$ points in a metric space, we show that after preprocessing which uses $O(n)$ distance computations we can compute a weighted sample $S\subset V$ of size $O(ε^{-2})$ such that the average distance from any query point $v$ to $V$ can be estimated from the distances from $v$ to $S$. Finally, we show that for a set of points $V$ in a metric space, we can estimate the average pairwise distance using $O(n+ε^{-2})$ distance computations. The estimate is based on a weighted sample of $O(ε^{-2})$ pairs of points, which is computed using $O(n)$ distance computations. Our estimates are unbiased with normalized mean square error (NRMSE) of at most $ε$. Increasing the sample size by a $O(\log n)$ factor ensures that the probability that the relative error exceeds $ε$ is polynomially small.

preprint2015arXiv

Stream Sampling for Frequency Cap Statistics

Unaggregated data, in streamed or distributed form, is prevalent and come from diverse application domains which include interactions of users with web services and IP traffic. Data elements have {\em keys} (cookies, users, queries) and elements with different keys interleave. Analytics on such data typically utilizes statistics stated in terms of the frequencies of keys. The two most common statistics are {\em distinct}, which is the number of active keys in a specified segment, and {\em sum}, which is the sum of the frequencies of keys in the segment. Both are special cases of {\em cap} statistics, defined as the sum of frequencies {\em capped} by a parameter $T$, which are popular in online advertising platforms. Aggregation by key, however, is costly, requiring state proportional to the number of distinct keys, and therefore we are interested in estimating these statistics or more generally, sampling the data, without aggregation. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design provides the first effective solution for general frequency cap statistics. Our $\ell$-capped samples provide estimates with tight statistical guarantees for cap statistics with $T=Θ(\ell)$ and nonnegative unbiased estimates of {\em any} monotone non-decreasing frequency statistics. An added benefit of our unified design is facilitating {\em multi-objective samples}, which provide estimates with statistical guarantees for a specified set of different statistics, using a single, smaller sample.

preprint2014arXiv

Computing Classic Closeness Centrality, at Scale

Closeness centrality, first considered by Bavelas (1948), is an importance measure of a node in a network which is based on the distances from the node to all other nodes. The classic definition, proposed by Bavelas (1950), Beauchamp (1965), and Sabidussi (1966), is (the inverse of) the average distance to all other nodes. We propose the first highly scalable (near linear-time processing and linear space overhead) algorithm for estimating, within a small relative error, the classic closeness centralities of all nodes in the graph. Our algorithm applies to undirected graphs, as well as for centrality computed with respect to round-trip distances in directed graphs. For directed graphs, we also propose an efficient algorithm that approximates generalizations of classic closeness centrality to outbound and inbound centralities. Although it does not provide worst-case theoretical approximation guarantees, it is designed to perform well on real networks. We perform extensive experiments on large networks, demonstrating high scalability and accuracy.

preprint2014arXiv

Distance Queries from Sampled Data: Accurate and Efficient

Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as measurements, snapshots of a system, content, traffic matrices, and activity logs are collected repeatedly. Random sampling, which can be efficiently performed over streamed or distributed data, is an important tool for scalable data analysis. The sample constitutes an extremely flexible summary, which naturally supports domain queries and scalable estimation of statistics, which can be specified after the sample is generated. The effectiveness of a sample as a summary, however, hinges on the estimators we have. We derive novel estimators for estimating $L_p$ distance from sampled data. Our estimators apply with the most common weighted sampling schemes: Poisson Probability Proportional to Size (PPS) and its fixed sample size variants. They also apply when the samples of different data sets are independent or coordinated. Our estimators are admissible (Pareto optimal in terms of variance) and have compelling properties. We study the performance of our Manhattan and Euclidean distance ($p=1,2$) estimators on diverse datasets, demonstrating scalability and accuracy even when a small fraction of the data is sampled. Our work, for the first time, facilitates effective distance estimation over sampled data.

preprint2014arXiv

Estimation for Monotone Sampling: Competitiveness and Customization

Random samples are lossy summaries which allow queries posed over the data to be approximated by applying an appropriate estimator to the sample. The effectiveness of sampling, however, hinges on estimator selection. The choice of estimators is subjected to global requirements, such as unbiasedness and range restrictions on the estimate value, and ideally, we seek estimators that are both efficient to derive and apply and {\em admissible} (not dominated, in terms of variance, by other estimators). Nevertheless, for a given data domain, sampling scheme, and query, there are many admissible estimators. We study the choice of admissible nonnegative and unbiased estimators for monotone sampling schemes. Monotone sampling schemes are implicit in many applications of massive data set analysis. Our main contribution is general derivations of admissible estimators with desirable properties. We present a construction of {\em order-optimal} estimators, which minimize variance according to {\em any} specified priorities over the data domain. Order-optimality allows us to customize the derivation to common patterns that we can learn or observe in the data. When we prioritize lower values (e.g., more similar data sets when estimating difference), we obtain the L$^*$ estimator, which is the unique monotone admissible estimator. We show that the L$^*$ estimator is 4-competitive and dominates the classic Horvitz-Thompson estimator. These properties make the L$^*$ estimator a natural default choice. We also present the U$^*$ estimator, which prioritizes large values (e.g., less similar data sets). Our estimator constructions are both easy to apply and possess desirable properties, allowing us to make the most from our summarized data.

preprint2014arXiv

Probe Scheduling for Efficient Detection of Silent Failures

Most discovery systems for silent failures work in two phases: a continuous monitoring phase that detects presence of failures through probe packets and a localization phase that pinpoints the faulty element(s). This separation is important because localization requires significantly more resources than detection and should be initiated only when a fault is present. We focus on improving the efficiency of the detection phase, where the goal is to balance the overhead with the cost associated with longer failure detection times. We formulate a general model which unifies the treatment of probe scheduling mechanisms, stochastic or deterministic, and different cost objectives - minimizing average detection time (SUM) or worst-case detection time (MAX). We then focus on two classes of schedules. {\em Memoryless schedules} -- a subclass of stochastic schedules which is simple and suitable for distributed deployment. We show that the optimal memorlyess schedulers can be efficiently computed by convex programs (for SUM objectives) or linear programs (for MAX objectives), and surprisingly perhaps, are guaranteed to have expected detection times that are not too far off the (NP hard) stochastic optima. {\em Deterministic schedules} allow us to bound the maximum (rather than expected) cost of undetected faults, but like stochastic schedules, are NP hard to optimize. We develop novel efficient deterministic schedulers with provable approximation ratios. An extensive simulation study on real networks, demonstrates significant performance gains of our memoryless and deterministic schedulers over previous approaches. Our unified treatment also facilitates a clear comparison between different objectives and scheduling mechanisms.

preprint2014arXiv

Sketch-based Influence Maximization and Computation: Scaling up with Guarantees

Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces. Basic computational problems in the study of diffusion are influence queries (determining the potency of a specified seed set of nodes) and Influence Maximization (identifying the most influential seed set of a given size). Answering each influence query involves many edge traversals, and does not scale when there are many queries on very large graphs. The gold standard for Influence Maximization is the greedy algorithm, which iteratively adds to the seed set a node maximizing the marginal gain in influence. Greedy has a guaranteed approximation ratio of at least (1-1/e) and actually produces a sequence of nodes, with each prefix having approximation guarantee with respect to the same-size optimum. Since Greedy does not scale well beyond a few million edges, for larger inputs one must currently use either heuristics or alternative algorithms designed for a pre-specified small seed set size. We develop a novel sketch-based design for influence computation. Our greedy Sketch-based Influence Maximization (SKIM) algorithm scales to graphs with billions of edges, with one to two orders of magnitude speedup over the best greedy methods. It still has a guaranteed approximation ratio, and in practice its quality nearly matches that of exact greedy. We also present influence oracles, which use linear-time preprocessing to generate a small sketch for each node, allowing the influence of any seed set to be quickly answered from the sketches of its nodes.

preprint2014arXiv

Variance Competitiveness for Monotone Estimation: Tightening the Bounds

Random samples are extensively used to summarize massive data sets and facilitate scalable analytics. Coordinated sampling, where samples of different data sets "share" the randomization, is a powerful method which facilitates more accurate estimation of many aggregates and similarity measures. We recently formulated a model of {\it Monotone Estimation Problems} (MEP), which can be applied to coordinated sampling, projected on a single item. MEP estimators can then be used to estimate sum aggregates, such as distances, over coordinated samples. For MEP, we are interested in estimators that are unbiased and nonnegative. We proposed {\it variance competitiveness} as a quality measure of estimators: For each data vector, we consider the minimum variance attainable on it by an unbiased and nonnegative estimator. We then define the competitiveness of an estimator as the maximum ratio, over data, of the expectation of the square to the minimum possible. We also presented a general construction of the L$^*$ estimator, which is defined for any MEP for which a nonnegative unbiased estimator exists, and is at most 4-competitive. Our aim here is to obtain tighter bounds on the {\em universal ratio}, which we define to be the smallest competitive ratio that can be obtained for any MEP. We obtain an upper bound of 3.375, improving over the bound of $4$ of the L$^*$ estimator. We also establish a lower bound of 1.44. The lower bound is obtained by constructing the {\it optimally competitive} estimator for particular MEPs. The construction is of independent interest, as it facilitates estimation with instance-optimal competitiveness.

preprint2013arXiv

A Labeling Approach to Incremental Cycle Detection

In the \emph{incremental cycle detection} problem arcs are added to a directed acyclic graph and the algorithm has to report if the new arc closes a cycle. One seeks to minimize the total time to process the entire sequence of arc insertions, or until a cycle appears. In a recent breakthrough, Bender, Fineman, Gilbert and Tarjan \cite{BeFiGiTa11} presented two different algorithms, with time complexity $O(n^2 \log n)$ and $O(m \cdot \min \{m^{1/2}, n^{2/3} \})$, respectively. In this paper we introduce a new technique for incremental cycle detection that allows us to obtain both bounds (up to a logarithmic factor). Furthermore, our approach seems more amiable for distributed implementation.

preprint2013arXiv

On the Tradeoff between Stability and Fit

In computing, as in many aspects of life, changes incur cost. Many optimization problems are formulated as a one-time instance starting from scratch. However, a common case that arises is when we already have a set of prior assignments, and must decide how to respond to a new set of constraints, given that each change from the current assignment comes at a price. That is, we would like to maximize the fitness or efficiency of our system, but we need to balance it with the changeout cost from the previous state. We provide a precise formulation for this tradeoff and analyze the resulting {\em stable extensions} of some fundamental problems in measurement and analytics. Our main technical contribution is a stable extension of PPS (probability proportional to size) weighted random sampling, with applications to monitoring and anomaly detection problems. We also provide a general framework that applies to top-$k$, minimum spanning tree, and assignment. In both cases, we are able to provide exact solutions, and discuss efficient incremental algorithms that can find new solutions as the input changes.

preprint2013arXiv

What you can do with Coordinated Samples

Sample coordination, where similar instances have similar samples, was proposed by statisticians four decades ago as a way to maximize overlap in repeated surveys. Coordinated sampling had been since used for summarizing massive data sets. The usefulness of a sampling scheme hinges on the scope and accuracy within which queries posed over the original data can be answered from the sample. We aim here to gain a fundamental understanding of the limits and potential of coordination. Our main result is a precise characterization, in terms of simple properties of the estimated function, of queries for which estimators with desirable properties exist. We consider unbiasedness, nonnegativity, finite variance, and bounded estimates. Since generally a single estimator can not be optimal (minimize variance simultaneously) for all data, we propose {\em variance competitiveness}, which means that the expectation of the square on any data is not too far from the minimum one possible for the data. Surprisingly perhaps, we show how to construct, for any function for which an unbiased nonnegative estimator exists, a variance competitive estimator.

preprint2011arXiv

Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information

Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries posed over the original data and the value of the sample hinges on the quality of these estimators. Our work targets data sets such as request and traffic logs and sensor measurements, where data is repeatedly collected over multiple {\em instances}: time periods, locations, or snapshots. We are interested in queries that span multiple instances, such as distinct counts and distance measures over selected records. These queries are used for applications ranging from planning to anomaly and change detection. Unbiased low-variance estimators are particularly effective as the relative error decreases with the number of selected record keys. The Horvitz-Thompson estimator, known to minimize variance for sampling with "all or nothing" outcomes (which reveals exacts value or no information on estimated quantity), is not optimal for multi-instance operations for which an outcome may provide partial information. We present a general principled methodology for the derivation of (Pareto) optimal unbiased estimators over sampled instances and aim to understand its potential. We demonstrate significant improvement in estimate accuracy of fundamental queries for common sampling schemes.

preprint2011arXiv

Structure-Aware Sampling: Flexible and Accurate Summarization

In processing large quantities of data, a fundamental problem is to obtain a summary which supports approximate query answering. Random sampling yields flexible summaries which naturally support subset-sum queries with unbiased estimators and well-understood confidence bounds. Classic sample-based summaries, however, are designed for arbitrary subset queries and are oblivious to the structure in the set of keys. The particular structure, such as hierarchy, order, or product space (multi-dimensional), makes range queries much more relevant for most analysis of the data. Dedicated summarization algorithms for range-sum queries have also been extensively studied. They can outperform existing sampling schemes in terms of accuracy on range queries per summary size. Their accuracy, however, rapidly degrades when, as is often the case, the query spans multiple ranges. They are also less flexible - being targeted for range sum queries alone - and are often quite costly to build and use. In this paper we propose and evaluate variance optimal sampling schemes that are structure-aware. These summaries improve over the accuracy of existing structure-oblivious sampling schemes on range queries while retaining the benefits of sample-based summaries: flexible summaries, with high accuracy on both range queries and arbitrary subset queries.

preprint2011arXiv

Truth and Envy in Capacitated Allocation Games

We study auctions with additive valuations where agents have a limit on the number of goods they may receive. We refer to such valuations as {\em capacitated} and seek mechanisms that maximize social welfare and are simultaneously incentive compatible, envy-free, individually rational, and have no positive transfers. If capacities are infinite, then sequentially repeating the 2nd price Vickrey auction meets these requirements. In 1983, Leonard showed that for unit capacities, VCG with Clarke Pivot payments is also envy free. For capacities that are all unit or all infinite, the mechanism produces a Walrasian pricing (subject to capacity constraints). Here, we consider general capacities. For homogeneous capacities (all capacities equal) we show that VCG with Clarke Pivot payments is envy free (VCG with Clarke Pivot payments is always incentive compatible, individually rational, and has no positive transfers). Contrariwise, there is no incentive compatible Walrasian pricing. For heterogeneous capacities, we show that there is no mechanism with all 4 properties, but at least in some cases, one can achieve both incentive compatibility and envy freeness.

preprint2010arXiv

Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments

Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attributes. Over such vector-weighted data we are interested in aggregates with respect to one set of weights, such as weighted sums, and aggregates over multiple sets of weights such as the $L_1$ difference. Sample-based summarization is highly effective for data sets that are too large to be stored or manipulated. The summary facilitates approximate processing queries that may be specified after the summary was generated. Current designs, however, are geared for data sets where a single {\em scalar} weight is associated with each key. We develop a sampling framework based on {\em coordinated weighted samples} that is suited for multiple weight assignments and obtain estimators that are {\em orders of magnitude tighter} than previously possible. We demonstrate the power of our methods through an extensive empirical evaluation on diverse data sets ranging from IP network to stock quotes data.

preprint2010arXiv

On the Interplay between Incentive Compatibility and Envy Freeness

We study mechanisms for an allocation of goods among agents, where agents have no incentive to lie about their true values (incentive compatible) and for which no agent will seek to exchange outcomes with another (envy-free). Mechanisms satisfying each requirement separately have been studied extensively, but there are few results on mechanisms achieving both. We are interested in those allocations for which there exist payments such that the resulting mechanism is simultaneously incentive compatible and envy-free. Cyclic monotonicity is a characterization of incentive compatible allocations, local efficiency is a characterization for envy-free allocations. We combine the above to give a characterization for allocations which are both incentive compatible and envy free. We show that even for allocations that allow payments leading to incentive compatible mechanisms, and other payments leading to envy free mechanisms, there may not exist any payments for which the mechanism is simultaneously incentive compatible and envy-free. The characterization that we give lets us compute the set of Pareto-optimal mechanisms that trade off envy freeness for incentive compatibility.

preprint2010arXiv

Stream sampling for variance-optimal estimation of subset sums

From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, $\varoptk$, that dominates all previous schemes in terms of estimation quality. $\varoptk$ provides {\em variance optimal unbiased estimation of subset sums}. More precisely, if we have seen $n$ items of the stream, then for {\em any} subset size $m$, our scheme based on $k$ samples minimizes the average variance over all subsets of size $m$. In fact, the optimality is against any off-line scheme with $k$ samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of {\em particular} subsets than previously possible. It is efficient, handling each new item of the stream in $O(\log k)$ time. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.

Edith Cohen

What is connected

Connect this record

See the researcher in context

Building this map preview

26 published item(s)

Stochastic Matching via Local Sparsification

On the Robustness of CountSketch to Adaptive Inputs

Tricking the Hashing Trick: A Tight Lower Bound on the Robustness of CountSketch to Adaptive Inputs

Graph Learning with Loss-Guided Training

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

Distance-Based Influence in Networks: Computation and Maximization

Greedy Maximization Framework for Graph-based Influence Functions

Reverse Ranking by Graph Structure: Model and Scalable Algorithms

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis

Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees

Stream Sampling for Frequency Cap Statistics

Computing Classic Closeness Centrality, at Scale

Distance Queries from Sampled Data: Accurate and Efficient

Estimation for Monotone Sampling: Competitiveness and Customization

Probe Scheduling for Efficient Detection of Silent Failures

Sketch-based Influence Maximization and Computation: Scaling up with Guarantees

Variance Competitiveness for Monotone Estimation: Tightening the Bounds

A Labeling Approach to Incremental Cycle Detection

On the Tradeoff between Stability and Fit

What you can do with Coordinated Samples

Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information

Structure-Aware Sampling: Flexible and Accurate Summarization

Truth and Envy in Capacitated Allocation Games

Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments

On the Interplay between Incentive Compatibility and Envy Freeness

Stream sampling for variance-optimal estimation of subset sums