Source author record

Jonathan Ullman

Jonathan Ullman appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Cryptography and Security Computer Science and Game Theory Information Theory math.IT Computational Complexity Databases

Catalog footprint

What is connected

33works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Testable and Actionable Calibration for Full Swap Regret

AI generated predictions increasingly inform decision making in critical tasks, and therefore must be trustworthy. One widely used measure of trustworthiness is calibration, which requires that the predictions match the true frequencies and can be treated like real probabilities of a given outcome. However, defining calibration is subtle, and designing good measures of calibration error has been an active topic of recent research. The first goal is to find calibration measures that are actionable, meaning they can inform decision makers about their utility loss when predictions are treated as true probabilities, which is known as swap regret. The second goal is to find calibration measures that are testable, meaning that calibration error can be measured from a small sample of predictions and outcomes. Although these are very basic requirements, there is no existing calibration measure that fully satisfies both properties, and all existing measures relax actionability by bounding a weaker notion of swap regret, or relax testability by having suboptimal estimation error. We introduce a new calibration measure, Soft-Binned Calibration Decision Loss (SCDL), which we prove is fully actionable without weakening either requirement, and testable with nearly optimal error rate. In addition, SCDL satisfies other desired properties such as continuity and consistency. We also provide a set of experiments confirming that the theoretical advantages of SCDL compared to other measures lead to better performance in practice.

preprint2022arXiv

A Private and Computationally-Efficient Estimator for Unbounded Gaussians

We give the first polynomial-time, polynomial-sample, differentially private estimator for the mean and covariance of an arbitrary Gaussian distribution $\mathcal{N}(μ,Σ)$ in $\mathbb{R}^d$. All previous estimators are either nonconstructive, with unbounded running time, or require the user to specify a priori bounds on the parameters $μ$ and $Σ$. The primary new technical tool in our algorithm is a new differentially private preconditioner that takes samples from an arbitrary Gaussian $\mathcal{N}(0,Σ)$ and returns a matrix $A$ such that $A ΣA^T$ has constant condition number.

preprint2022arXiv

Fair and Useful Cohort Selection

A challenge in fair algorithm design is that, while there are compelling notions of individual fairness, these notions typically do not satisfy desirable composition properties, and downstream applications based on fair classifiers might not preserve fairness. To study fairness under composition, Dwork and Ilvento introduced an archetypal problem called fair-cohort-selection problem, where a single fair classifier is composed with itself to select a group of candidates of a given size, and proposed a solution to this problem. In this work we design algorithms for selecting cohorts that not only preserve fairness, but also maximize the utility of the selected cohort under two notions of utility that we introduce and motivate. We give optimal (or approximately optimal) polynomial-time algorithms for this problem in both an offline setting, and an online setting where candidates arrive one at a time and are classified as they arrive.

preprint2022arXiv

How to Combine Membership-Inference Attacks on Multiple Updated Models

A large body of research has shown that machine learning models are vulnerable to membership inference (MI) attacks that violate the privacy of the participants in the training data. Most MI research focuses on the case of a single standalone model, while production machine-learning platforms often update models over time, on data that often shifts in distribution, giving the attacker more information. This paper proposes new attacks that take advantage of one or more model updates to improve MI. A key part of our approach is to leverage rich information from standalone MI attacks mounted separately against the original and updated models, and to combine this information in specific ways to improve attack effectiveness. We propose a set of combination functions and tuning methods for each, and present both analytical and quantitative justification for various options. Our results on four public datasets show that our attacks are effective at using update information to give the adversary a significant advantage over attacks on standalone models, but also compared to a prior MI attack that takes advantage of model updates in a related machine-unlearning setting. We perform the first measurements of the impact of distribution shift on MI attacks with model updates, and show that a more drastic distribution shift results in significantly higher MI risk than a gradual shift. Our code is available at https://www.github.com/stanleykywu/model-updates.

preprint2022arXiv

Multitask Learning via Shared Features: Algorithms and Hardness

We investigate the computational efficiency of multitask learning of Boolean functions over the $d$-dimensional hypercube, that are related by means of a feature representation of size $k \ll d$ shared across all tasks. We present a polynomial time multitask learning algorithm for the concept class of halfspaces with margin $γ$, which is based on a simultaneous boosting technique and requires only $\textrm{poly}(k/γ)$ samples-per-task and $\textrm{poly}(k\log(d)/γ)$ samples in total. In addition, we prove a computational separation, showing that assuming there exists a concept class that cannot be learned in the attribute-efficient model, we can construct another concept class such that can be learned in the attribute-efficient model, but cannot be multitask learned efficiently -- multitask learning this concept class either requires super-polynomial time complexity or a much larger total number of samples.

preprint2022arXiv

Private Identity Testing for High-Dimensional Distributions

In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in $\mathbb{R}^d$ with known covariance and product distributions over $\{\pm 1\}^{d}$. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of $O(d^{1/2}/α^2)$ in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.

preprint2021arXiv

Private Mean Estimation of Heavy-Tailed Distributions

We give new upper and lower bounds on the minimax sample complexity of differentially private mean estimation of distributions with bounded $k$-th moments. Roughly speaking, in the univariate case, we show that $n = Θ\left(\frac{1}{α^2} + \frac{1}{α^{\frac{k}{k-1}}\varepsilon}\right)$ samples are necessary and sufficient to estimate the mean to $α$-accuracy under $\varepsilon$-differential privacy, or any of its common relaxations. This result demonstrates a qualitatively different behavior compared to estimation absent privacy constraints, for which the sample complexity is identical for all $k \geq 2$. We also give algorithms for the multivariate setting whose sample complexity is a factor of $O(d)$ larger than the univariate case.

preprint2020arXiv

A Primer on Private Statistics

Differentially private statistical estimation has seen a flurry of developments over the last several years. Study has been divided into two schools of thought, focusing on empirical statistics versus population statistics. We suggest that these two lines of work are more similar than different by giving examples of methods that were initially framed for empirical statistics, but can be applied just as well to population statistics. We also provide a thorough coverage of recent work in this area.

preprint2020arXiv

Auditing Differentially Private Machine Learning: How Private is Private SGD?

We investigate whether Differentially Private SGD offers better privacy in practice than what is guaranteed by its state-of-the-art analysis. We do so via novel data poisoning attacks, which we show correspond to realistic privacy attacks. While previous work (Ma et al., arXiv 2019) proposed this connection between differential privacy and data poisoning as a defense against data poisoning, our use as a tool for understanding the privacy of a specific mechanism is new. More generally, our work takes a quantitative, empirical approach to understanding the privacy afforded by specific implementations of differentially private algorithms that we believe has the potential to complement and influence analytical work on differential privacy.

preprint2020arXiv

Efficient Private Algorithms for Learning Large-Margin Halfspaces

We present new differentially private algorithms for learning a large-margin halfspace. In contrast to previous algorithms, which are based on either differentially private simulations of the statistical query model or on private convex optimization, the sample complexity of our algorithms depends only on the margin of the data, and not on the dimension. We complement our results with a lower bound, showing that the dependence of our upper bounds on the margin is optimal.

preprint2020arXiv

Private Query Release Assisted by Public Data

We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class $\mathcal{H}$ of statistical queries with error no more than $α$ using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in terms of the private and public sample complexities. First, we show that we can solve the problem for any query class $\mathcal{H}$ of finite VC-dimension using only $d/α$ public samples and $\sqrt{p}d^{3/2}/α^2$ private samples, where $d$ and $p$ are the VC-dimension and dual VC-dimension of $\mathcal{H}$, respectively. In comparison, with only private samples, this problem cannot be solved even for simple query classes with VC-dimension one, and without any private samples, a larger public sample of size $d/α^2$ is needed. Next, we give sample complexity lower bounds that exhibit tight dependence on $p$ and $α$. For the class of decision stumps, we give a lower bound of $\sqrt{p}/α$ on the private sample complexity whenever the public sample size is less than $1/α^2$. Given our upper bounds, this shows that the dependence on $\sqrt{p}$ is necessary in the private sample complexity. We also give a lower bound of $1/α$ on the public sample complexity for a broad family of query classes, which by our upper bound, is tight in $α$.

preprint2016arXiv

Make Up Your Mind: The Price of Online Queries in Differential Privacy

We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set Q of allowable queries in one of three ways, which we list in order from easiest to hardest to answer: Offline: The queries are chosen all at once and the differentially private mechanism answers the queries in a single batch. Online: The queries are chosen all at once, but the mechanism only receives the queries in a streaming fashion and must answer each query before seeing the next query. Adaptive: The queries are chosen one at a time and the mechanism must answer each query before the next query is chosen. In particular, each query may depend on the answers given to previous queries. Many differentially private mechanisms are just as efficient in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models may be equivalent. We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that exponentially more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries like threshold queries over the real line.

preprint2016arXiv

Some Pairs Problems

A common form of MapReduce application involves discovering relationships between certain pairs of inputs. Similarity joins serve as a good example of this type of problem, which we call a "some-pairs" problem. In the framework of Afrati et al. (VLDB 2013), algorithms are measured by the tradeoff between reducer size (maximum number of inputs a reducer can handle) and the replication rate (average number of reducers to which an input must be sent. There are two obvious approaches to solving some-pairs problems in general. We show that no general-purpose MapReduce algorithm can beat both of these two algorithms in the worst case. We then explore a recursive algorithm for solving some-pairs problems and heuristics for beating the lower bound on common instances of the some-pairs class of problems.

preprint2016arXiv

Space Lower Bounds for Itemset Frequency Sketches

Given a database, computing the fraction of rows that contain a query itemset or determining whether this fraction is above some threshold are fundamental operations in data mining. A uniform sample of rows is a good sketch of the database in the sense that all sufficiently frequent itemsets and their approximate frequencies are recoverable from the sample, and the sketch size is independent of the number of rows in the original database. For many seemingly similar problems there are better sketching algorithms than uniform sampling. In this paper we show that for itemset frequency sketching this is not the case. That is, we prove that there exist classes of databases for which uniform sampling is a space optimal sketch for approximate itemset frequency analysis, up to constant or iterated-logarithmic factors.

preprint2016arXiv

Strong Hardness of Privacy from Weak Traitor Tracing

Despite much study, the computational complexity of differential privacy remains poorly understood. In this paper we consider the computational complexity of accurately answering a family $Q$ of statistical queries over a data universe $X$ under differential privacy. A statistical query on a dataset $D \in X^n$ asks "what fraction of the elements of $D$ satisfy a given predicate $p$ on $X$?" Dwork et al. (STOC'09) and Boneh and Zhandry (CRYPTO'14) showed that if both $Q$ and $X$ are of polynomial size, then there is an efficient differentially private algorithm that accurately answers all the queries, and if both $Q$ and $X$ are exponential size, then under a plausible assumption, no efficient algorithm exists. We show that, under the same assumption, if either the number of queries or the data universe is of exponential size, and the other has size at least $\tilde{O}(n^7)$, then there is no differentially private algorithm that answers all the queries. In both cases, the result is nearly quantitatively tight, since there is an efficient differentially private algorithm that answers $\tildeΩ(n^2)$ queries on an exponential size data universe, and one that answers exponentially many queries on a data universe of size $\tildeΩ(n^2)$. Our proofs build on the connection between hardness results in differential privacy and traitor-tracing schemes (Dwork et al., STOC'09; Ullman, STOC'13). We prove our hardness result for a polynomial size query set (resp., data universe) by showing that they follow from the existence of a special type of traitor-tracing scheme with very short ciphertexts (resp., secret keys), but very weak security guarantees, and then constructing such a scheme.

preprint2015arXiv

Algorithmic Stability for Adaptive Data Analysis

Adaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution $\mathbf{P}$ and a set of $n$ independent samples $\mathbf{x}$ is drawn from $\mathbf{P}$. We seek an algorithm that, given $\mathbf{x}$ as input, accurately answers a sequence of adaptively chosen queries about the unknown distribution $\mathbf{P}$. How many samples $n$ must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions: (i) We give upper bounds on the number of samples $n$ that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015, NIPS, 2015). (ii) We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries. As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that stability implies low generalization error. We also study weaker stability guarantees such as bounded KL divergence and total variation distance.

preprint2015arXiv

Between Pure and Approximate Differential Privacy

We show a new lower bound on the sample complexity of $(\varepsilon, δ)$-differentially private algorithms that accurately answer statistical queries on high-dimensional databases. The novelty of our bound is that it depends optimally on the parameter $δ$, which loosely corresponds to the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy ($δ> 0$) and pure differential privacy ($δ= 0$). Specifically, we consider a database $D \in \{\pm1\}^{n \times d}$ and its \emph{one-way marginals}, which are the $d$ queries of the form "What fraction of individual records have the $i$-th bit set to $+1$?" We show that in order to answer all of these queries to within error $\pm α$ (on average) while satisfying $(\varepsilon, δ)$-differential privacy, it is necessary that $$ n \geq Ω\left( \frac{\sqrt{d \log(1/δ)}}{α\varepsilon} \right), $$ which is optimal up to constant factors. To prove our lower bound, we build on the connection between \emph{fingerprinting codes} and lower bounds in differential privacy (Bun, Ullman, and Vadhan, STOC'14). In addition to our lower bound, we give new purely and approximately differentially private algorithms for answering arbitrary statistical queries that improve on the sample complexity of the standard Laplace and Gaussian mechanisms for achieving worst-case accuracy guarantees by a logarithmic factor.

preprint2015arXiv

Inducing Approximately Optimal Flow Using Truthful Mediators

We revisit a classic coordination problem from the perspective of mechanism design: how can we coordinate a social welfare maximizing flow in a network congestion game with selfish players? The classical approach, which computes tolls as a function of known demands, fails when the demands are unknown to the mechanism designer, and naively eliciting them does not necessarily yield a truthful mechanism. Instead, we introduce a weak mediator that can provide suggested routes to players and set tolls as a function of reported demands. However, players can choose to ignore or misreport their type to this mediator. Using techniques from differential privacy, we show how to design a weak mediator such that it is an asymptotic ex-post Nash equilibrium for all players to truthfully report their types to the mediator and faithfully follow its suggestion, and that when they do, they end up playing a nearly optimal flow. Notably, our solution works in settings of incomplete information even in the absence of a prior distribution on player types. Along the way, we develop new techniques for privately solving convex programs which may be of independent interest.

preprint2015arXiv

Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery

We show an essentially tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given $n$ samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is "close" to the correct expectation over the distribution. This question was recently studied by Dwork et al., who showed how to answer $\tildeΩ(n^2)$ queries efficiently, and also by Hardt and Ullman, who showed that answering $\tilde{O}(n^3)$ queries is hard. We close the gap between the two bounds and show that, under a standard hardness assumption, there is no computationally efficient algorithm that, given $n$ samples from an unknown distribution, can give valid answers to $O(n^2)$ adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results using a new connection between the problem of answering adaptively chosen statistical queries and a combinatorial object called an interactive fingerprinting code. In order to optimize our hardness result, we give a new Fourier-analytic approach to analyzing fingerprinting codes that is simpler, more flexible, and yields better parameters than previous constructions.

preprint2015arXiv

Mechanism Design in Large Games: Incentives and Privacy

We study the problem of implementing equilibria of complete information games in settings of incomplete information, and address this problem using "recommender mechanisms." A recommender mechanism is one that does not have the power to enforce outcomes or to force participation, rather it only has the power to suggestion outcomes on the basis of voluntary participation. We show that despite these restrictions, recommender mechanisms can implement equilibria of complete information games in settings of incomplete information under the condition that the game is large---i.e. that there are a large number of players, and any player's action affects any other's payoff by at most a small amount. Our result follows from a novel application of differential privacy. We show that any algorithm that computes a correlated equilibrium of a complete information game while satisfying a variant of differential privacy---which we call joint differential privacy---can be used as a recommender mechanism while satisfying our desired incentive properties. Our main technical result is an algorithm for computing a correlated equilibrium of a large game while satisfying joint differential privacy. Although our recommender mechanisms are designed to satisfy game-theoretic properties, our solution ends up satisfying a strong privacy property as well. No group of players can learn "much" about the type of any player outside the group from the recommendations of the mechanism, even if these players collude in an arbitrary way. As such, our algorithm is able to implement equilibria of complete information games, without revealing information about the realized types.

preprint2015arXiv

More General Queries and Less Generalization Error in Adaptive Data Analysis

Adaptivity is an important feature of data analysis---typically the choice of questions asked about a dataset depends on previous interactions with the same dataset. However, generalization error is typically bounded in a non-adaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC '15) and Hardt and Ullman (FOCS '14) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution $\mathcal{P}$ and a set of $n$ independent samples $x$ is drawn from $\mathcal{P}$. We seek an algorithm that, given $x$ as input, "accurately" answers a sequence of adaptively chosen "queries" about the unknown distribution $\mathcal{P}$. How many samples $n$ must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions towards resolving this question: *We give upper bounds on the number of samples $n$ that are needed to answer statistical queries that improve over the bounds of Dwork et al. *We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and the important class of convex risk minimization queries. As in Dwork et al., our algorithms are based on a connection between differential privacy and generalization error, but we feel that our analysis is simpler and more modular, which may be useful for studying these questions in the future.

preprint2015arXiv

Private Multiplicative Weights Beyond Linear Queries

A wide variety of fundamental data analyses in machine learning, such as linear and logistic regression, require minimizing a convex function defined by the data. Since the data may contain sensitive information about individuals, and these analyses can leak that sensitive information, it is important to be able to solve convex minimization in a privacy-preserving way. A series of recent results show how to accurately solve a single convex minimization problem in a differentially private manner. However, the same data is often analyzed repeatedly, and little is known about solving multiple convex minimization problems with differential privacy. For simpler data analyses, such as linear queries, there are remarkable differentially private algorithms such as the private multiplicative weights mechanism (Hardt and Rothblum, FOCS 2010) that accurately answer exponentially many distinct queries. In this work, we extend these results to the case of convex minimization and show how to give accurate and differentially private solutions to *exponentially many* convex minimization problems on a sensitive dataset.

preprint2015arXiv

Robust Mediators in Large Games

A mediator is a mechanism that can only suggest actions to players, as a function of all agents' reported types, in a given game of incomplete information. We study what is achievable by two kinds of mediators, "strong" and "weak." Players can choose to opt-out of using a strong mediator but cannot misrepresent their type if they opt-in. Such a mediator is "strong" because we can view it as having the ability to verify player types. Weak mediators lack this ability--- players are free to misrepresent their type to a weak mediator. We show a striking result---in a prior-free setting, assuming only that the game is large and players have private types, strong mediators can implement approximate equilibria of the complete-information game. If the game is a congestion game, then the same result holds using only weak mediators. Our result follows from a novel application of differential privacy, in particular, a variant we propose called joint differential privacy.

preprint2015arXiv

Watch and Learn: Optimizing from Revealed Preferences Feedback

A Stackelberg game is played between a leader and a follower. The leader first chooses an action, then the follower plays his best response. The goal of the leader is to pick the action that will maximize his payoff given the follower's best response. In this paper we present an approach to solving for the leader's optimal strategy in certain Stackelberg games where the follower's utility function (and thus the subsequent best response of the follower) is unknown. Stackelberg games capture, for example, the following interaction between a producer and a consumer. The producer chooses the prices of the goods he produces, and then a consumer chooses to buy a utility maximizing bundle of goods. The goal of the seller here is to set prices to maximize his profit---his revenue, minus the production cost of the purchased bundle. It is quite natural that the seller in this example should not know the buyer's utility function. However, he does have access to revealed preference feedback---he can set prices, and then observe the purchased bundle and his own profit. We give algorithms for efficiently solving, in terms of both computational and query complexity, a broad class of Stackelberg games in which the follower's utility function is unknown, using only "revealed preference" access to it. This class includes in particular the profit maximization problem, as well as the optimal tolling problem in nonatomic congestion games, when the latency functions are unknown. Surprisingly, we are able to solve these problems even though the optimization problems are non-convex in the leader's actions.

preprint2015arXiv

When Can Limited Randomness Be Used in Repeated Games?

The central result of classical game theory states that every finite normal form game has a Nash equilibrium, provided that players are allowed to use randomized (mixed) strategies. However, in practice, humans are known to be bad at generating random-like sequences, and true random bits may be unavailable. Even if the players have access to enough random bits for a single instance of the game their randomness might be insufficient if the game is played many times. In this work, we ask whether randomness is necessary for equilibria to exist in finitely repeated games. We show that for a large class of games containing arbitrary two-player zero-sum games, approximate Nash equilibria of the $n$-stage repeated version of the game exist if and only if both players have $Ω(n)$ random bits. In contrast, we show that there exists a class of games for which no equilibrium exists in pure strategies, yet the $n$-stage repeated version of the game has an exact Nash equilibrium in which each player uses only a constant number of random bits. When the players are assumed to be computationally bounded, if cryptographic pseudorandom generators (or, equivalently, one-way functions) exist, then the players can base their strategies on "random-like" sequences derived from only a small number of truly random bits. We show that, in contrast, in repeated two-player zero-sum games, if pseudorandom generators \emph{do not} exist, then $Ω(n)$ random bits remain necessary for equilibria to exist.

preprint2014arXiv

An Anti-Folk Theorem for Large Repeated Games with Imperfect Monitoring

We study infinitely repeated games in settings of imperfect monitoring. We first prove a family of theorems that show that when the signals observed by the players satisfy a condition known as $(ε, γ)$-differential privacy, that the folk theorem has little bite: for values of $ε$ and $γ$ sufficiently small, for a fixed discount factor, any equilibrium of the repeated game involve players playing approximate equilibria of the stage game in every period. Next, we argue that in large games ($n$ player games in which unilateral deviations by single players have only a small impact on the utility of other players), many monitoring settings naturally lead to signals that satisfy $(ε,γ)$-differential privacy, for $ε$ and $γ$ tending to zero as the number of players $n$ grows large. We conclude that in such settings, the set of equilibria of the repeated game collapse to the set of equilibria of the stage game.

preprint2014arXiv

Faster Algorithms for Privately Releasing Marginals

We study the problem of releasing $k$-way marginals of a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of $D$'s records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS '07). We give an algorithm that runs in time $d^{O(\sqrt{k})}$ and releases a private summary capable of answering any $k$-way marginal query with at most $\pm .01$ error on every query as long as $n \geq d^{O(\sqrt{k})}$. To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of $k$-way marginal queries, which is $d^{Θ(k)}$ (for $k \ll d$).

preprint2014arXiv

Preventing False Discovery in Interactive Data Analysis is Hard

We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given $n$ samples from an unknown distribution can give valid answers to $n^{3+o(1)}$ adaptively chosen statistical queries. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is valid if it is "close" to the correct expectation over the distribution. Our result stands in stark contrast to the well known fact that exponentially many statistical queries can be answered validly and efficiently if the queries are chosen non-adaptively (no query may depend on the answers to previous queries). Moreover, a recent work by Dwork et al. shows how to accurately answer exponentially many adaptively chosen statistical queries via a computationally inefficient algorithm; and how to answer a quadratic number of adaptive queries via a computationally efficient algorithm. The latter result implies that our result is tight up to a linear factor in $n.$ Conceptually, our result demonstrates that achieving statistical validity alone can be a source of computational intractability in adaptive settings. For example, in the modern large collaborative research environment, data analysts typically choose a particular approach based on previous findings. False discovery occurs if a research finding is supported by the data but not by the underlying distribution. While the study of preventing false discovery in Statistics is decades old, to the best of our knowledge our result is the first to demonstrate a computational barrier. In particular, our result suggests that the perceived difficulty of preventing false discovery in today's collaborative research environment may be inherent.

preprint2013arXiv

Answering n^{2+o(1)} Counting Queries with Differential Privacy is Hard

A central problem in differentially private data analysis is how to design efficient algorithms capable of answering large numbers of counting queries on a sensitive database. Counting queries of the form "What fraction of individual records in the database satisfy the property q?" We prove that if one-way functions exist, then there is no algorithm that takes as input a database D in ({0,1}^d)^n, and k = n^{2+o(1)} arbitrary efficiently computable counting queries, runs in time poly(d, n), and returns an approximate answer to each query, while satisfying differential privacy. We also consider the complexity of answering "simple" counting queries, and make some progress in this direction by showing that the above result holds even when we require that the queries are computable by constant depth (AC-0) circuits. Our result is almost tight in the sense that nearly n^2 counting queries can be answered efficiently while satisfying differential privacy. Moreover, super-polynomially many queries can be answered in exponential time. We prove our results by extending the connection between differentially private counting query release and cryptographic traitor-tracing schemes to the setting where the queries are given to the sanitizer as input, and by constructing a traitor-tracing scheme that is secure in this setting.

preprint2013arXiv

Faster Private Release of Marginals on Small Databases

We study the problem of answering \emph{$k$-way marginal} queries on a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of the database's records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses on a dataset, and designing efficient algorithms for privately answering marginal queries has been identified as an important open problem in private data analysis. For any $k$, we give a differentially private online algorithm that runs in time $$ \min{\exp(d^{1-Ω(1/\sqrt{k})}), \exp(d / \log^{.99} d)\} $$ per query and answers any (possibly superpolynomially long and adaptively chosen) sequence of $k$-way marginal queries up to error at most $\pm .01$ on every query, provided $n \gtrsim d^{.51} $. To the best of our knowledge, this is the first algorithm capable of privately answering marginal queries with a non-trivial worst-case accuracy guarantee on a database of size $\poly(d, k)$ in time $\exp(o(d))$. Our algorithms are a variant of the private multiplicative weights algorithm (Hardt and Rothblum, FOCS '10), but using a different low-weight representation of the database. We derive our low-weight representation using approximations to the OR function by low-degree polynomials with coefficients of bounded $L_1$-norm. We also prove a strong limitation on our approach that is of independent approximation-theoretic interest. Specifically, we show that for any $k = o(\log d)$, any polynomial with coefficients of $L_1$-norm $poly(d)$ that pointwise approximates the $d$-variate OR function on all inputs of Hamming weight at most $k$ must have degree $d^{1-O(1/\sqrt{k})}$.

preprint2011arXiv

Iterative Constructions and Private Data Release

In this paper we study the problem of approximately releasing the cut function of a graph while preserving differential privacy, and give new algorithms (and new analyses of existing algorithms) in both the interactive and non-interactive settings. Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs). We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error O(|V|^{1.5}). Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem.

preprint2011arXiv

On the Zero-Error Capacity Threshold for Deletion Channels

We consider the zero-error capacity of deletion channels. Specifically, we consider the setting where we choose a codebook ${\cal C}$ consisting of strings of $n$ bits, and our model of the channel corresponds to an adversary who may delete up to $pn$ of these bits for a constant $p$. Our goal is to decode correctly without error regardless of the actions of the adversary. We consider what values of $p$ allow non-zero capacity in this setting. We suggest multiple approaches, one of which makes use of the natural connection between this problem and the problem of finding the expected length of the longest common subsequence of two random sequences.

preprint2011arXiv

Privately Releasing Conjunctions and the Statistical Query Barrier

Suppose we would like to know all answers to a set of statistical queries C on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in C. Can we do any better? + We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of C in Kearns' statistical query (SQ) model. This gives a complete answer to the question when running time is not a concern. + We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. While interesting from a learning theoretic point of view, our main applications are in privacy-preserving data analysis: Here, our second result leads to the first algorithm that efficiently releases differentially private answers to of all Boolean conjunctions with 1% average error. This presents significant progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms.

Jonathan Ullman

What is connected

Connect this record

See the researcher in context

Building this map preview

33 published item(s)

Testable and Actionable Calibration for Full Swap Regret

A Private and Computationally-Efficient Estimator for Unbounded Gaussians

Fair and Useful Cohort Selection

How to Combine Membership-Inference Attacks on Multiple Updated Models

Multitask Learning via Shared Features: Algorithms and Hardness

Private Identity Testing for High-Dimensional Distributions

Private Mean Estimation of Heavy-Tailed Distributions

A Primer on Private Statistics

Auditing Differentially Private Machine Learning: How Private is Private SGD?

Efficient Private Algorithms for Learning Large-Margin Halfspaces

Private Query Release Assisted by Public Data

Make Up Your Mind: The Price of Online Queries in Differential Privacy

Some Pairs Problems

Space Lower Bounds for Itemset Frequency Sketches

Strong Hardness of Privacy from Weak Traitor Tracing

Algorithmic Stability for Adaptive Data Analysis

Between Pure and Approximate Differential Privacy

Inducing Approximately Optimal Flow Using Truthful Mediators

Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery

Mechanism Design in Large Games: Incentives and Privacy

More General Queries and Less Generalization Error in Adaptive Data Analysis

Private Multiplicative Weights Beyond Linear Queries

Robust Mediators in Large Games

Watch and Learn: Optimizing from Revealed Preferences Feedback

When Can Limited Randomness Be Used in Repeated Games?

An Anti-Folk Theorem for Large Repeated Games with Imperfect Monitoring

Faster Algorithms for Privately Releasing Marginals

Preventing False Discovery in Interactive Data Analysis is Hard

Answering n^{2+o(1)} Counting Queries with Differential Privacy is Hard

Faster Private Release of Marginals on Small Databases

Iterative Constructions and Private Data Release

On the Zero-Error Capacity Threshold for Deletion Channels

Privately Releasing Conjunctions and the Statistical Query Barrier