Source author record

Sergei Vassilvitskii

Sergei Vassilvitskii appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Computer Science and Game Theory Cryptography and Security Databases Computational Engineering, Finance, and Science Computational Geometry Distributed, Parallel, and Cluster Computing Information Retrieval Information Theory math.IT

Catalog footprint

What is connected

16works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Differentially Private Continual Releases of Streaming Frequency Moment Estimations

The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible. Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental $\ell_p$ $(p\in [0,+\infty))$ frequency moment estimation problem under this setting, and give an $\varepsilon$-DP algorithm that achieves $(1+η)$-relative approximation $(\forall η\in(0,1))$ with $\mathrm{poly}\log(Tn)$ additive error and uses $\mathrm{poly}\log(Tn)\cdot \max(1, n^{1-2/p})$ space, where $T$ is the length of the stream and $n$ is the size of the universe of elements. Our space is near optimal up to poly-logarithmic factors even in the non-private setting. To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting. Based on these primitives, we develop a differentially private continual release level set estimation approach to address the $\ell_p$ frequency moment estimation problem. We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past $W$ data items.

preprint2022arXiv

Plume: Differential Privacy at Scale

Differential privacy has become the standard for private data analysis, and an extensive literature now offers differentially private solutions to a wide variety of problems. However, translating these solutions into practical systems often requires confronting details that the literature ignores or abstracts away: users may contribute multiple records, the domain of possible records may be unknown, and the eventual system must scale to large volumes of data. Failure to carefully account for all three issues can severely impair a system's quality and usability. We present Plume, a system built to address these problems. We describe a number of sometimes subtle implementation issues and offer practical solutions that, together, make an industrial-scale system for differentially private data analysis possible. Plume is currently deployed at Google and is routinely used to process datasets with trillions of records.

preprint2022arXiv

Scalable Differentially Private Clustering via Hierarchically Separated Trees

We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / ε^2)$, where $ε$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.

preprint2020arXiv

Algorithms with Predictions

We introduce algorithms that use predictions from machine learning applied to the input to circumvent worst-case analysis. We aim for algorithms that have near optimal performance when these predictions are good, but recover the prediction-less worst case behavior when the predictions have large errors.

preprint2020arXiv

Competitive caching with machine learned advice

Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error. In this work we develop a framework for augmenting online algorithms with a machine learned oracle to achieve competitive ratios that provably improve upon unconditional worst case lower bounds when the oracle has low error. Our approach treats the oracle as a complete black box, and is not dependent on its inner workings, or the exact distribution of its errors. We apply this framework to the traditional caching problem -- creating an eviction strategy for a cache of size $k$. We demonstrate that naively following the oracle's recommendations may lead to very poor performance, even when the average error is quite low. Instead we show how to modify the Marker algorithm to take into account the oracle's predictions, and prove that this combined approach achieves a competitive ratio that both (i) decreases as the oracle's error decreases, and (ii) is always capped by $O(\log k)$, which can be achieved without any oracle input. We complement our results with an empirical evaluation of our algorithm on real world datasets, and show that it performs well empirically even using simple off-the-shelf predictions.

preprint2020arXiv

Fair Hierarchical Clustering

As machine learning has become more prevalent, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering. In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a specific objective. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, with only a negligible loss in the objective.

preprint2016arXiv

A Field Guide to Personalized Reserve Prices

We study the question of setting and testing reserve prices in single item auctions when the bidders are not identical. At a high level, there are two generalizations of the standard second price auction: in the lazy version we first determine the winner, and then apply reserve prices; in the eager version we first discard the bidders not meeting their reserves, and then determine the winner among the rest. We show that the two versions have dramatically different properties: lazy reserves are easy to optimize, and A/B test in production, whereas eager reserves always lead to higher welfare, but their optimization is NP-complete, and naive A/B testing will lead to incorrect conclusions. Despite their different characteristics, we show that the overall revenue for the two scenarios is always within a factor of 2 of each other, even in the presence of correlated bids. Moreover, we prove that the eager auction dominates the lazy auction on revenue whenever the bidders are independent or symmetric. We complement our theoretical results with simulations on real world data that show that even suboptimally set eager reserve prices are preferred from a revenue standpoint.

preprint2016arXiv

Submodular Optimization over Sliding Windows

Maximizing submodular functions under cardinality constraints lies at the core of numerous data mining and machine learning applications, including data diversification, data summarization, and coverage problems. In this work, we study this question in the context of data streams, where elements arrive one at a time, and we want to design low-memory and fast update-time algorithms that maintain a good solution. Specifically, we focus on the sliding window model, where we are asked to maintain a solution that considers only the last $W$ items. In this context, we provide the first non-trivial algorithm that maintains a provable approximation of the optimum using space sublinear in the size of the window. In particular we give a $\frac{1}{3} - ε$ approximation algorithm that uses space polylogarithmic in the spread of the values of the elements, $Φ$, and linear in the solution size $k$ for any constant $ε> 0$ . At the same time, processing each element only requires a polylogarithmic number of evaluations of the function itself. When a better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a $\frac{1}{2} - ε$ approximation and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem. We demonstrate the efficacy of the algorithms on a number of real world datasets, showing that their practical performance far exceeds the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a negligible fraction of them.

preprint2015arXiv

Sketching, Embedding, and Dimensionality Reduction for Information Spaces

Information distances like the Hellinger distance and the Jensen-Shannon divergence have deep roots in information theory and machine learning. They are used extensively in data analysis especially when the objects being compared are high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications efficiently and at scale with any kind of provable guarantees. We can't sketch these distances easily, or embed them in better behaved spaces, or even reduce the dimensionality of the space while maintaining the probability structure of the data. In this paper, we build these tools for information distances---both for the Hellinger distance and Jensen--Shannon divergence, as well as related measures, like the $χ^2$ divergence. We first show that they can be sketched efficiently (i.e. up to multiplicative error in sublinear space) in the aggregate streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the strict turnstile streaming model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon and $χ^2$ divergences that preserves pair wise distances. Finally we prove a dimensionality reduction result for the Hellinger, Jensen--Shannon, and $χ^2$ divergences that preserves the information geometry of the distributions (specifically, by retaining the simplex structure of the space). While our second result above already implies that these divergences can be explicitly embedded in Euclidean space, retaining the simplex structure is important because it allows us to continue doing inference in the reduced space. In essence, we preserve not just the distance structure but the underlying geometry of the space.

preprint2014arXiv

Value of Targeting

We undertake a formal study of the value of targeting data to an advertiser. As expected, this value is increasing in the utility difference between realizations of the targeting data and the accuracy of the data, and depends on the distribution of competing bids. However, this value may vary non-monotonically with an advertiser's budget. Similarly, modeling the values as either private or correlated, or allowing other advertisers to also make use of the data, leads to unpredictable changes in the value of data. We address questions related to multiple data sources, show that utility of additional data may be non-monotonic, and provide tradeoffs between the quality and the price of data sources. In a game-theoretic setting, we show that advertisers may be worse off than if the data had not been available at all. We also ask whether a publisher can infer the value an advertiser would place on targeting data from the advertiser's bidding behavior and illustrate that this is impossible.

preprint2012arXiv

Ad Serving Using a Compact Allocation Plan

A large fraction of online display advertising is sold via guaranteed contracts: a publisher guarantees to the advertiser a certain number of user visits satisfying the targeting predicates of the contract. The publisher is then tasked with solving the ad serving problem - given a user visit, which of the thousands of matching contracts should be displayed, so that by the expiration time every contract has obtained the requisite number of user visits. The challenges of the problem come from (1) the sheer size of the problem being solved, with tens of thousands of contracts and billions of user visits, (2) the unpredictability of user behavior, since these contracts are sold months ahead of time, when only a forecast of user visits is available and (3) the minute amount of resources available online, as an ad server must respond with a matching contract in a fraction of a second. We present a solution to the guaranteed delivery ad serving problem using {\em compact allocation plans}. These plans, computed offline, can be efficiently queried by the ad server during an ad call; they are small, using only O(1) space for contract; and are stateless, allowing for distributed serving without any central coordination. We evaluate this approach on a real set of user visits and guaranteed contracts and show that the compact allocation plans are an effective way of solving the guaranteed delivery ad serving problem.

preprint2012arXiv

Densest Subgraph in Streaming and MapReduce

The problem of finding locally dense components of a graph is an important primitive in data analysis, with wide-ranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in the streaming model. For any epsilon>0, our algorithms make O((log n)/log (1+epsilon)) passes over the input and find a subgraph whose density is guaranteed to be within a factor 2(1+epsilon) of the optimum. Our algorithms are also easily parallelizable and we illustrate this by realizing them in the MapReduce model. In addition we perform extensive experimental evaluation on massive real-world graphs showing the performance and scalability of our algorithms in practice.

preprint2012arXiv

Scalable K-Means++

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

preprint2012arXiv

SHALE: An Efficient Algorithm for Allocation of Guaranteed Display Advertising

Motivated by the problem of optimizing allocation in guaranteed display advertising, we develop an efficient, lightweight method of generating a compact {\em allocation plan} that can be used to guide ad server decisions. The plan itself uses just O(1) state per guaranteed contract, is robust to noise, and allows us to serve (provably) nearly optimally. The optimization method we develop is scalable, with a small in-memory footprint, and working in linear time per iteration. It is also "stop-anytime", meaning that time-critical applications can stop early and still get a good serving solution. Thus, it is particularly useful for optimizing the large problems arising in the context of display advertising. We demonstrate the effectiveness of our algorithm using actual Yahoo! data.

preprint2011arXiv

Factorization-based Lossless Compression of Inverted Indices

Many large-scale Web applications that require ranked top-k retrieval such as Web search and online advertising are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document association. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space that minimizes the resulting index size as a matrix factorization problem, and prove that finding the optimal factorization is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach. Our experimental evaluation demonstrates that our approach achieves an index size reduction of 20%, while maintaining the same query response times. Higher compression ratios up to 35% are achievable, however at the cost of slightly longer query response times. Furthermore, combining our approach with other lossless compression techniques, namely variable-byte encoding, leads to index size reduction of up to 50%.

preprint2010arXiv

Inventory Allocation for Online Graphical Display Advertising

We discuss a multi-objective/goal programming model for the allocation of inventory of graphical advertisements. The model considers two types of campaigns: guaranteed delivery (GD), which are sold months in advance, and non-guaranteed delivery (NGD), which are sold using real-time auctions. We investigate various advertiser and publisher objectives such as (a) revenue from the sale of impressions, clicks and conversions, (b) future revenue from the sale of NGD inventory, and (c) "fairness" of allocation. While the first two objectives are monetary, the third is not. This combination of demand types and objectives leads to potentially many variations of our model, which we delineate and evaluate. Our experimental results, which are based on optimization runs using real data sets, demonstrate the effectiveness and flexibility of the proposed model.

Sergei Vassilvitskii

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Differentially Private Continual Releases of Streaming Frequency Moment Estimations

Plume: Differential Privacy at Scale

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Algorithms with Predictions

Competitive caching with machine learned advice

Fair Hierarchical Clustering

A Field Guide to Personalized Reserve Prices

Submodular Optimization over Sliding Windows

Sketching, Embedding, and Dimensionality Reduction for Information Spaces

Value of Targeting

Ad Serving Using a Compact Allocation Plan

Densest Subgraph in Streaming and MapReduce

Scalable K-Means++

SHALE: An Efficient Algorithm for Allocation of Guaranteed Display Advertising

Factorization-based Lossless Compression of Inverted Indices

Inventory Allocation for Online Graphical Display Advertising