Source author record

Ilias Diakonikolas

Ilias Diakonikolas appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning math.ST Statistics Theory Computational Complexity math.PR Information Theory math.IT Computer Science and Game Theory Discrete Mathematics Computational Geometry Databases math.OC

Catalog footprint

What is connected

56works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Cryptographic Hardness of Learning Halfspaces with Massart Noise

We study the complexity of PAC learning halfspaces in the presence of Massart noise. In this problem, we are given i.i.d. labeled examples $(\mathbf{x}, y) \in \mathbb{R}^N \times \{ \pm 1\}$, where the distribution of $\mathbf{x}$ is arbitrary and the label $y$ is a Massart corruption of $f(\mathbf{x})$, for an unknown halfspace $f: \mathbb{R}^N \to \{ \pm 1\}$, with flipping probability $η(\mathbf{x}) \leq η< 1/2$. The goal of the learner is to compute a hypothesis with small 0-1 error. Our main result is the first computational hardness result for this learning problem. Specifically, assuming the (widely believed) subexponential-time hardness of the Learning with Errors (LWE) problem, we show that no polynomial-time Massart halfspace learner can achieve error better than $Ω(η)$, even if the optimal 0-1 error is small, namely $\mathrm{OPT} = 2^{-\log^{c} (N)}$ for any universal constant $c \in (0, 1)$. Prior work had provided qualitatively similar evidence of hardness in the Statistical Query model. Our computational hardness result essentially resolves the polynomial PAC learnability of Massart halfspaces, by showing that known efficient learning algorithms for the problem are nearly best possible.

preprint2022arXiv

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapstoσ(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $σ:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=ε$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(σ(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, ε$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/ε)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/ε))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tildeΩ(d/ε)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

preprint2022arXiv

Near-Optimal Bounds for Testing Histogram Distributions

We investigate the problem of testing whether a discrete probability distribution over an ordered domain is a histogram on a specified number of bins. One of the most common tools for the succinct approximation of data, $k$-histograms over $[n]$, are probability distributions that are piecewise constant over a set of $k$ intervals. The histogram testing problem is the following: Given samples from an unknown distribution $\mathbf{p}$ on $[n]$, we want to distinguish between the cases that $\mathbf{p}$ is a $k$-histogram versus $\varepsilon$-far from any $k$-histogram, in total variation distance. Our main result is a sample near-optimal and computationally efficient algorithm for this testing problem, and a nearly-matching (within logarithmic factors) sample complexity lower bound. Specifically, we show that the histogram testing problem has sample complexity $\widetilde Θ(\sqrt{nk} / \varepsilon + k / \varepsilon^2 + \sqrt{n} / \varepsilon^2)$.

preprint2022arXiv

Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models

We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $ε$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(ε\sqrt{\log(1/ε)})$. Similarly, we show that no efficient SQ algorithm with access to an $ε$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(ε\log(1/ε))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions.

preprint2022arXiv

ReLU Regression with Massart Noise

We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $η< 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.

preprint2021arXiv

Agnostic Proper Learning of Halfspaces under Gaussian Marginals

We study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose sample complexity and computational complexity qualitatively match those of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial-time approximation scheme (PTAS) for agnostically learning homogeneous halfspaces. Our techniques naturally extend to agnostically learning linear models with respect to other non-linear activations, yielding in particular the first proper agnostic algorithm for ReLU regression.

preprint2021arXiv

Outlier-Robust Learning of Ising Models Under Dobrushin's Condition

We study the problem of learning Ising models satisfying Dobrushin's condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted. Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees. Our algorithm can be seen as a special case of an algorithm for robustly learning a distribution from a general exponential family. To prove its correctness for Ising models, we establish new anti-concentration results for degree-$2$ polynomials of Ising models that may be of independent interest.

preprint2021arXiv

The Optimality of Polynomial Regression for Agnostic Learning under Gaussian Marginals

We study the problem of agnostic learning under the Gaussian distribution. We develop a method for finding hard families of examples for a wide class of problems by using LP duality. For Boolean-valued concept classes, we show that the $L^1$-regression algorithm is essentially best possible, and therefore that the computational difficulty of agnostically learning a concept class is closely related to the polynomial degree required to approximate any function from the class in $L^1$-norm. Using this characterization along with additional analytic tools, we obtain optimal SQ lower bounds for agnostically learning linear threshold functions and the first non-trivial SQ lower bounds for polynomial threshold functions and intersections of halfspaces. We also develop an analogous theory for agnostically learning real-valued functions, and as an application prove near-optimal SQ lower bounds for agnostically learning ReLUs and sigmoids.

preprint2020arXiv

Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks

We study the problem of PAC learning one-hidden-layer ReLU networks with $k$ hidden units on $\mathbb{R}^d$ under Gaussian marginals in the presence of additive label noise. For the case of positive coefficients, we give the first polynomial-time algorithm for this learning problem for $k$ up to $\tilde{O}(\sqrt{\log d})$. Previously, no polynomial time algorithm was known, even for $k=3$. This answers an open question posed by~\cite{Kliv17}. Importantly, our algorithm does not require any assumptions about the rank of the weight matrix and its complexity is independent of its condition number. On the negative side, for the more general task of PAC learning one-hidden-layer ReLU networks with arbitrary real coefficients, we prove a Statistical Query lower bound of $d^{Ω(k)}$. Thus, we provide a separation between the two classes in terms of efficient learnability. Our upper and lower bounds are general, extending to broader families of activation functions.

preprint2020arXiv

Efficient Algorithms for Multidimensional Segmented Regression

We study the fundamental problem of fixed design {\em multidimensional segmented regression}: Given noisy samples from a function $f$, promised to be piecewise linear on an unknown set of $k$ rectangles, we want to recover $f$ up to a desired accuracy in mean-squared error. We provide the first sample and computationally efficient algorithm for this problem in any fixed dimension. Our algorithm relies on a simple iterative merging approach, which is novel in the multidimensional setting. Our experimental evaluation on both synthetic and real datasets shows that our algorithm is competitive and in some cases outperforms state-of-the-art heuristics. Code of our implementation is available at \url{https://github.com/avoloshinov/multidimensional-segmented-regression}.

preprint2020arXiv

Efficiently Learning Adversarially Robust Halfspaces with Noise

We study the problem of learning adversarially robust halfspaces in the distribution-independent setting. In the realizable setting, we provide necessary and sufficient conditions on the adversarial perturbation sets under which halfspaces are efficiently robustly learnable. In the presence of random label noise, we give a simple computationally efficient algorithm for this problem with respect to any $\ell_p$-perturbation.

preprint2020arXiv

High-Dimensional Robust Mean Estimation via Gradient Descent

We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomial-time algorithms for this problem with dimension-independent error guarantees for a range of natural distribution families. In this work, we show that a natural non-convex formulation of the problem can be solved directly by gradient descent. Our approach leverages a novel structural lemma, roughly showing that any approximate stationary point of our non-convex objective gives a near-optimal solution to the underlying robust estimation task. Our work establishes an intriguing connection between algorithmic high-dimensional robust statistics and non-convex optimization, which may have broader applications to other robust estimation tasks.

preprint2020arXiv

Learning Halfspaces with Massart Noise Under Structured Distributions

We study the problem of learning halfspaces with Massart noise in the distribution-specific PAC model. We give the first computationally efficient algorithm for this problem with respect to a broad family of distributions, including log-concave distributions. This resolves an open question posed in a number of prior works. Our approach is extremely simple: We identify a smooth {\em non-convex} surrogate loss with the property that any approximate stationary point of this loss defines a halfspace that is close to the target halfspace. Given this structural result, we can use SGD to solve the underlying learning problem.

preprint2020arXiv

Learning Halfspaces with Tsybakov Noise

We study the efficient PAC learnability of halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, each label is independently flipped with some probability which is controlled by an adversary. This noise model significantly generalizes the Massart noise model, by allowing the flipping probabilities to be arbitrarily close to $1/2$ for a fraction of the samples. Our main result is the first non-trivial PAC learning algorithm for this problem under a broad family of structured distributions -- satisfying certain concentration and (anti-)anti-concentration properties -- including log-concave distributions. Specifically, we given an algorithm that achieves misclassification error $ε$ with respect to the true halfspace, with quasi-polynomial runtime dependence in $1/\epsilin$. The only previous upper bound for this problem -- even for the special case of log-concave distributions -- was doubly exponential in $1/ε$ (and follows via the naive reduction to agnostic learning). Our approach relies on a novel computationally efficient procedure to certify whether a candidate solution is near-optimal, based on semi-definite programming. We use this certificate procedure as a black-box and turn it into an efficient learning algorithm by searching over the space of halfspaces via online convex optimization.

preprint2020arXiv

List-Decodable Mean Estimation via Iterative Multi-Filtering

We study the problem of {\em list-decodable mean estimation} for bounded covariance distributions. Specifically, we are given a set $T$ of points in $\mathbb{R}^d$ with the promise that an unknown $α$-fraction of points in $T$, where $0< α< 1/2$, are drawn from an unknown mean and bounded covariance distribution $D$, and no assumptions are made on the remaining points. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the mean of $D$. We give the first practically viable estimator for this problem. In more detail, our algorithm is sample and computationally efficient, and achieves information-theoretically near-optimal error. While the only prior algorithm for this setting inherently relied on the ellipsoid method, our algorithm is iterative and only uses spectral techniques. Our main technical innovation is the design of a soft outlier removal procedure for high-dimensional heavy-tailed datasets with a majority of outliers.

preprint2020arXiv

Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals

We study the fundamental problems of agnostically learning halfspaces and ReLUs under Gaussian marginals. In the former problem, given labeled examples $(\mathbf{x}, y)$ from an unknown distribution on $\mathbb{R}^d \times \{ \pm 1\}$, whose marginal distribution on $\mathbf{x}$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with 0-1 loss $\mathrm{OPT}+ε$, where $\mathrm{OPT}$ is the 0-1 loss of the best-fitting halfspace. In the latter problem, given labeled examples $(\mathbf{x}, y)$ from an unknown distribution on $\mathbb{R}^d \times \mathbb{R}$, whose marginal distribution on $\mathbf{x}$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with square loss $\mathrm{OPT}+ε$, where $\mathrm{OPT}$ is the square loss of the best-fitting ReLU. We prove Statistical Query (SQ) lower bounds of $d^{\mathrm{poly}(1/ε)}$ for both of these problems. Our SQ lower bounds provide strong evidence that current upper bounds for these tasks are essentially best possible.

preprint2020arXiv

Non-Convex SGD Learns Halfspaces with Adversarial Label Noise

We study the problem of agnostically learning homogeneous halfspaces in the distribution-specific PAC model. For a broad family of structured distributions, including log-concave distributions, we show that non-convex SGD efficiently converges to a solution with misclassification error $O(\opt)+\eps$, where $\opt$ is the misclassification error of the best-fitting halfspace. In sharp contrast, we show that optimizing any convex surrogate inherently leads to misclassification error of $ω(\opt)$, even under Gaussian marginals.

preprint2020arXiv

Optimal Testing of Discrete Distributions with High Probability

We study the problem of testing discrete distributions with a focus on the high probability regime. Specifically, given samples from one or more discrete distributions, a property $\mathcal{P}$, and parameters $0< ε, δ<1$, we want to distinguish {\em with probability at least $1-δ$} whether these distributions satisfy $\mathcal{P}$ or are $ε$-far from $\mathcal{P}$ in total variation distance. Most prior work in distribution testing studied the constant confidence case (corresponding to $δ= Ω(1)$), and provided sample-optimal testers for a range of properties. While one can always boost the confidence probability of any such tester by black-box amplification, this generic boosting method typically leads to sub-optimal sample bounds. Here we study the following broad question: For a given property $\mathcal{P}$, can we {\em characterize} the sample complexity of testing $\mathcal{P}$ as a function of all relevant problem parameters, including the error probability $δ$? Prior to this work, uniformity testing was the only statistical task whose sample complexity had been characterized in this setting. As our main results, we provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors, as a function of all relevant parameters. We also show matching information-theoretic lower bounds on the sample complexity of these problems. Our techniques naturally extend to give optimal testers for related problems. To illustrate the generality of our methods, we give optimal algorithms for testing collections of distributions and testing closeness with unequal sized samples.

preprint2020arXiv

Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees

Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptotic guarantees, making them unreliable, or techniques that are correct but not tight, i.e., they offer rigorous guarantees, but are overly conservative, leading to confidence intervals that are too loose to be useful. In this paper, we develop a CI technique that is both correct and tighter than traditional approaches. Starting from conservative CIs, we identify two issues they often face: pessimistic mass allocation (PMA) and phantom outlier sensitivity (PHOS). By developing a novel range-trimming technique for eliminating PHOS and pairing it with known CI techniques without PMA, we develop a technique for computing CIs with strong guarantees that requires fewer samples for the same width. We implement our techniques underneath a sampling-optimized in-memory column store and show how to accelerate queries involving aggregates on a real dataset with speedups of up to 124x over traditional AQP-with-guarantees and more than 1000x over exact methods.

preprint2020arXiv

Robustly Learning any Clusterable Mixture of Gaussians

We study the efficient learnability of high-dimensional Gaussian mixtures in the outlier-robust setting, where a small constant fraction of the data is adversarially corrupted. We resolve the polynomial learnability of this problem when the components are pairwise separated in total variation distance. Specifically, we provide an algorithm that, for any constant number of components $k$, runs in polynomial time and learns the components of an $ε$-corrupted $k$-mixture within information theoretically near-optimal error of $\tilde{O}(ε)$, under the assumption that the overlap between any pair of components $P_i, P_j$ (i.e., the quantity $1-TV(P_i, P_j)$) is bounded by $\mathrm{poly}(ε)$. Our separation condition is the qualitatively weakest assumption under which accurate clustering of the samples is possible. In particular, it allows for components with arbitrary covariances and for components with identical means, as long as their covariances differ sufficiently. Ours is the first polynomial time algorithm for this problem, even for $k=2$. Our algorithm follows the Sum-of-Squares based proofs to algorithms approach. Our main technical contribution is a new robust identifiability proof of clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system. The key ingredients of this proof are a novel use of SoS-certifiable anti-concentration and a new characterization of pairs of Gaussians with small (dimension-independent) overlap in terms of their parameter distance.

preprint2020arXiv

Testing Bayesian Networks

This work initiates a systematic investigation of testing high-dimensional structured distributions by focusing on testing Bayesian networks -- the prototypical family of directed graphical models. A Bayesian network is defined by a directed acyclic graph, where we associate a random variable with each node. The value at any particular node is conditionally independent of all the other non-descendant nodes once its parents are fixed. Specifically, we study the properties of identity testing and closeness testing of Bayesian networks. Our main contribution is the first non-trivial efficient testing algorithms for these problems and corresponding information-theoretic lower bounds. For a wide range of parameter settings, our testing algorithms have sample complexity sublinear in the dimension and are sample-optimal, up to constant factors.

preprint2020arXiv

The Complexity of Adversarially Robust Proper Learning of Halfspaces with Agnostic Noise

We study the computational complexity of adversarially robust proper learning of halfspaces in the distribution-independent agnostic PAC model, with a focus on $L_p$ perturbations. We give a computationally efficient learning algorithm and a nearly matching computational hardness result for this problem. An interesting implication of our findings is that the $L_{\infty}$ perturbations case is provably computationally harder than the case $2 \leq p < \infty$.

preprint2020arXiv

The Sample Complexity of Robust Covariance Testing

We study the problem of testing the covariance matrix of a high-dimensional Gaussian in a robust setting, where the input distribution has been corrupted in Huber's contamination model. Specifically, we are given i.i.d. samples from a distribution of the form $Z = (1-ε) X + εB$, where $X$ is a zero-mean and unknown covariance Gaussian $\mathcal{N}(0, Σ)$, $B$ is a fixed but unknown noise distribution, and $ε>0$ is an arbitrarily small constant representing the proportion of contamination. We want to distinguish between the cases that $Σ$ is the identity matrix versus $γ$-far from the identity in Frobenius norm. In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples. Moreover, this sample upper bound was shown to be best possible, within constant factors. Our main result is that the sample complexity of covariance testing dramatically increases in the contaminated setting. In particular, we prove a sample complexity lower bound of $Ω(d^2)$ for $ε$ an arbitrarily small constant and $γ= 1/2$. This lower bound is best possible, as $O(d^2)$ samples suffice to even robustly {\em learn} the covariance. The conceptual implication of our result is that, for the natural setting we consider, robust hypothesis testing is at least as hard as robust estimation.

preprint2016arXiv

A New Approach for Testing Properties of Discrete Distributions

In this work, we give a novel general approach for distribution testing. We describe two techniques: our first technique gives sample-optimal testers, while our second technique gives matching sample lower bounds. As a consequence, we resolve the sample complexity of a wide variety of testing problems. Our upper bounds are obtained via a modular reduction-based approach. Our approach yields optimal testers for numerous problems by using a standard $\ell_2$-identity tester as a black-box. Using this recipe, we obtain simple estimators for a wide range of problems, encompassing most problems previously studied in the TCS literature, namely: (1) identity testing to a fixed distribution, (2) closeness testing between two unknown distributions (with equal/unequal sample sizes), (3) independence testing (in any number of dimensions), (4) closeness testing for collections of distributions, and (5) testing histograms. For all of these problems, our testers are sample-optimal, up to constant factors. With the exception of (1), ours are the {\em first sample-optimal testers for the corresponding problems.} Moreover, our estimators are significantly simpler to state and analyze compared to previous results. As an application of our reduction-based technique, we obtain the first {\em nearly instance-optimal} algorithm for testing equivalence between two {\em unknown} distributions. Moreover, our technique naturally generalizes to other metrics beyond the $\ell_1$-distance. Our lower bounds are obtained via a direct information-theoretic approach: Given a candidate hard instance, our proof proceeds by bounding the mutual information between appropriate random variables. While this is a classical method in information theory, prior to our work, it had not been used in distribution property testing.

preprint2016arXiv

Collision-based Testers are Optimal for Uniformity and Closeness

We study the fundamental problems of (i) uniformity testing of a discrete distribution, and (ii) closeness testing between two discrete distributions with bounded $\ell_2$-norm. These problems have been extensively studied in distribution testing and sample-optimal estimators are known for them~\cite{Paninski:08, CDVV14, VV14, DKN:15}. In this work, we show that the original collision-based testers proposed for these problems ~\cite{GRdist:00, BFR+:00} are sample-optimal, up to constant factors. Previous analyses showed sample complexity upper bounds for these testers that are optimal as a function of the domain size $n$, but suboptimal by polynomial factors in the error parameter $ε$. Our main contribution is a new tight analysis establishing that these collision-based testers are information-theoretically optimal, up to constant factors, both in the dependence on $n$ and in the dependence on $ε$.

preprint2016arXiv

Fast Algorithms for Segmented Regression

We study the fixed design segmented regression problem: Given noisy samples from a piecewise linear function $f$, we want to recover $f$ up to a desired accuracy in mean-squared error. Previous rigorous approaches for this problem rely on dynamic programming (DP) and, while sample efficient, have running time quadratic in the sample size. As our main contribution, we provide new sample near-linear time algorithms for the problem that -- while not being minimax optimal -- achieve a significantly better sample-time tradeoff on large datasets compared to the DP approach. Our experimental evaluation shows that, compared with the DP approach, our algorithms provide a convergence rate that is only off by a factor of $2$ to $4$, while achieving speedups of three orders of magnitude.

preprint2016arXiv

Near-Optimal Disjoint-Path Facility Location Through Set Cover by Pairs

In this paper we consider two special cases of the "cover-by-pairs" optimization problem that arise when we need to place facilities so that each customer is served by two facilities that reach it by disjoint shortest paths. These problems arise in a network traffic monitoring scheme proposed by Breslau et al. and have potential applications to content distribution. The "set-disjoint" variant applies to networks that use the OSPF routing protocol, and the "path-disjoint" variant applies when MPLS routing is enabled, making better solutions possible at the cost of greater operational expense. Although we can prove that no polynomial-time algorithm can guarantee good solutions for either version, we are able to provide heuristics that do very well in practice on instances with real-world network structure. Fast implementations of the heuristics, made possible by exploiting mathematical observations about the relationship between the network instances and the corresponding instances of the cover-by-pairs problem, allow us to perform an extensive experimental evaluation of the heuristics and what the solutions they produce tell us about the effectiveness of the proposed monitoring scheme. For the set-disjoint variant, we validate our claim of near-optimality via a new lower-bounding integer programming formulation. Although computing this lower bound requires solving the NP-hard Hitting Set problem and can underestimate the optimal value by a linear factor in the worst case, it can be computed quickly by CPLEX, and it equals the optimal solution value for all the instances in our extensive testbed.

preprint2016arXiv

Playing Anonymous Games using Simple Strategies

We investigate the complexity of computing approximate Nash equilibria in anonymous games. Our main algorithmic result is the following: For any $n$-player anonymous game with a bounded number of strategies and any constant $δ>0$, an $O(1/n^{1-δ})$-approximate Nash equilibrium can be computed in polynomial time. Complementing this positive result, we show that if there exists any constant $δ>0$ such that an $O(1/n^{1+δ})$-approximate equilibrium can be computed in polynomial time, then there is a fully polynomial-time approximation scheme for this problem. We also present a faster algorithm that, for any $n$-player $k$-strategy anonymous game, runs in time $\tilde O((n+k) k n^k)$ and computes an $\tilde O(n^{-1/3} k^{11/3})$-approximate equilibrium. This algorithm follows from the existence of simple approximate equilibria of anonymous games, where each player plays one strategy with probability $1-δ$, for some small $δ$, and plays uniformly at random with probability $δ$. Our approach exploits the connection between Nash equilibria in anonymous games and Poisson multinomial distributions (PMDs). Specifically, we prove a new probabilistic lemma establishing the following: Two PMDs, with large variance in each direction, whose first few moments are approximately matching are close in total variation distance. Our structural result strengthens previous work by providing a smooth tradeoff between the variance bound and the number of matching moments.

preprint2016arXiv

Testing Shape Restrictions of Discrete Distributions

We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution $D$ over $[n]$ and a property $\mathcal{P}$, the goal is to distinguish between $D\in\mathcal{P}$ and $\ell_1(D,\mathcal{P})>\varepsilon$. We develop a general algorithm for this question, which applies to a large range of "shape-constrained" properties, including monotone, log-concave, $t$-modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for all cases considered, our algorithm has near-optimal sample complexity with regard to the domain size and is computationally efficient. For most of these classes, we provide the first non-trivial tester in the literature. In addition, we also describe a generic method to prove lower bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we extend some of our techniques to tolerant testing, deriving nearly-tight upper and lower bounds for the corresponding questions.

preprint2016arXiv

The Fourier Transform of Poisson Multinomial Distributions and its Algorithmic Applications

An $(n, k)$-Poisson Multinomial Distribution (PMD) is a random variable of the form $X = \sum_{i=1}^n X_i$, where the $X_i$'s are independent random vectors supported on the set of standard basis vectors in $\mathbb{R}^k.$ In this paper, we obtain a refined structural understanding of PMDs by analyzing their Fourier transform. As our core structural result, we prove that the Fourier transform of PMDs is {\em approximately sparse}, i.e., roughly speaking, its $L_1$-norm is small outside a small set. By building on this result, we obtain the following applications: {\bf Learning Theory.} We design the first computationally efficient learning algorithm for PMDs with respect to the total variation distance. Our algorithm learns an arbitrary $(n, k)$-PMD within variation distance $ε$ using a near-optimal sample size of $\widetilde{O}_k(1/ε^2),$ and runs in time $\widetilde{O}_k(1/ε^2) \cdot \log n.$ Previously, no algorithm with a $\mathrm{poly}(1/ε)$ runtime was known, even for $k=3.$ {\bf Game Theory.} We give the first efficient polynomial-time approximation scheme (EPTAS) for computing Nash equilibria in anonymous games. For normalized anonymous games with $n$ players and $k$ strategies, our algorithm computes a well-supported $ε$-Nash equilibrium in time $n^{O(k^3)} \cdot (k/ε)^{O(k^3\log(k/ε)/\log\log(k/ε))^{k-1}}.$ The best previous algorithm for this problem had running time $n^{(f(k)/ε)^k},$ where $f(k) = Ω(k^{k^2})$, for any $k>2.$ {\bf Statistics.} We prove a multivariate central limit theorem (CLT) that relates an arbitrary PMD to a discretized multivariate Gaussian with the same mean and covariance, in total variation distance. Our new CLT strengthens the CLT of Valiant and Valiant by completely removing the dependence on $n$ in the error bound.

preprint2015arXiv

Learning Poisson Binomial Distributions

We consider a basic problem in unsupervised learning: learning an unknown \emph{Poisson Binomial Distribution}. A Poisson Binomial Distribution (PBD) over $\{0,1,\dots,n\}$ is the distribution of a sum of $n$ independent Bernoulli random variables which may have arbitrary, potentially non-equal, expectations. These distributions were first studied by S. Poisson in 1837 \cite{Poisson:37} and are a natural $n$-parameter generalization of the familiar Binomial Distribution. Surprisingly, prior to our work this basic learning problem was poorly understood, and known results for it were far from optimal. We essentially settle the complexity of the learning problem for this basic class of distributions. As our first main result we give a highly efficient algorithm which learns to $\eps$-accuracy (with respect to the total variation distance) using $\tilde{O}(1/\eps^3)$ samples \emph{independent of $n$}. The running time of the algorithm is \emph{quasilinear} in the size of its input data, i.e., $\tilde{O}(\log(n)/\eps^3)$ bit-operations. (Observe that each draw from the distribution is a $\log(n)$-bit string.) Our second main result is a {\em proper} learning algorithm that learns to $\eps$-accuracy using $\tilde{O}(1/\eps^2)$ samples, and runs in time $(1/\eps)^{\poly (\log (1/\eps))} \cdot \log n$. This is nearly optimal, since any algorithm {for this problem} must use $Ω(1/\eps^2)$ samples. We also give positive and negative results for some extensions of this learning problem to weighted sums of independent Bernoulli random variables.

preprint2015arXiv

Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions

We give a general unified method that can be used for $L_1$ {\em closeness testing} of a wide range of univariate structured distribution families. More specifically, we design a sample optimal and computationally efficient algorithm for testing the equivalence of two unknown (potentially arbitrary) univariate distributions under the $\mathcal{A}_k$-distance metric: Given sample access to distributions with density functions $p, q: I \to \mathbb{R}$, we want to distinguish between the cases that $p=q$ and $\|p-q\|_{\mathcal{A}_k} \ge ε$ with probability at least $2/3$. We show that for any $k \ge 2, ε>0$, the {\em optimal} sample complexity of the $\mathcal{A}_k$-closeness testing problem is $Θ(\max\{ k^{4/5}/ε^{6/5}, k^{1/2}/ε^2 \})$. This is the first $o(k)$ sample algorithm for this problem, and yields new, simple $L_1$ closeness testers, in most cases with optimal sample complexity, for broad classes of structured distributions.

preprint2015arXiv

Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables

We study the structure and learnability of sums of independent integer random variables (SIIRVs). For $k \in \mathbb{Z}_{+}$, a $k$-SIIRV of order $n \in \mathbb{Z}_{+}$ is the probability distribution of the sum of $n$ independent random variables each supported on $\{0, 1, \dots, k-1\}$. We denote by ${\cal S}_{n,k}$ the set of all $k$-SIIRVs of order $n$. In this paper, we tightly characterize the sample and computational complexity of learning $k$-SIIRVs. More precisely, we design a computationally efficient algorithm that uses $\widetilde{O}(k/ε^2)$ samples, and learns an arbitrary $k$-SIIRV within error $ε,$ in total variation distance. Moreover, we show that the {\em optimal} sample complexity of this learning problem is $Θ((k/ε^2)\sqrt{\log(1/ε)}).$ Our algorithm proceeds by learning the Fourier transform of the target $k$-SIIRV in its effective support. Its correctness relies on the {\em approximate sparsity} of the Fourier transform of $k$-SIIRVs -- a structural property that we establish, roughly stating that the Fourier transform of $k$-SIIRVs has small magnitude outside a small set. Along the way we prove several new structural results about $k$-SIIRVs. As one of our main structural contributions, we give an efficient algorithm to construct a sparse {\em proper} $ε$-cover for ${\cal S}_{n,k},$ in total variation distance. We also obtain a novel geometric characterization of the space of $k$-SIIRVs. Our characterization allows us to prove a tight lower bound on the size of $ε$-covers for ${\cal S}_{n,k}$, and is the key ingredient in our tight sample complexity lower bound. Our approach of exploiting the sparsity of the Fourier transform in distribution learning is general, and has recently found additional applications.

preprint2015arXiv

Properly Learning Poisson Binomial Distributions in Almost Polynomial Time

We give an algorithm for properly learning Poisson binomial distributions. A Poisson binomial distribution (PBD) of order $n$ is the discrete probability distribution of the sum of $n$ mutually independent Bernoulli random variables. Given $\widetilde{O}(1/ε^2)$ samples from an unknown PBD $\mathbf{p}$, our algorithm runs in time $(1/ε)^{O(\log \log (1/ε))}$, and outputs a hypothesis PBD that is $ε$-close to $\mathbf{p}$ in total variation distance. The previously best known running time for properly learning PBDs was $(1/ε)^{O(\log(1/ε))}$. As one of our main contributions, we provide a novel structural characterization of PBDs. We prove that, for all $ε>0,$ there exists an explicit collection $\cal{M}$ of $(1/ε)^{O(\log \log (1/ε))}$ vectors of multiplicities, such that for any PBD $\mathbf{p}$ there exists a PBD $\mathbf{q}$ with $O(\log(1/ε))$ distinct parameters whose multiplicities are given by some element of ${\cal M}$, such that $\mathbf{q}$ is $ε$-close to $\mathbf{p}$. Our proof combines tools from Fourier analysis and algebraic geometry. Our approach to the proper learning problem is as follows: Starting with an accurate non-proper hypothesis, we fit a PBD to this hypothesis. More specifically, we essentially start with the hypothesis computed by the computationally efficient non-proper learning algorithm in our recent work~\cite{DKS15}. Our aforementioned structural characterization allows us to reduce the corresponding fitting problem to a collection of $(1/ε)^{O(\log \log(1/ε))}$ systems of low-degree polynomial inequalities. We show that each such system can be solved in time $(1/ε)^{O(\log \log(1/ε))}$, which yields the overall running time of our algorithm.

preprint2015arXiv

Sample-Optimal Density Estimation in Nearly-Linear Time

We design a new, fast algorithm for agnostically learning univariate probability distributions whose densities are well approximated by piecewise polynomial functions. Let $f$ be the density function of an arbitrary univariate distribution, and suppose that $f$ is $\mathrm{OPT}$-close in $L_1$-distance to an unknown piecewise polynomial function with $t$ interval pieces and degree $d$. Our algorithm draws $n = O(t(d+1)/ε^2)$ samples from $f$, runs in time $\tilde{O}(n \cdot \mathrm{poly}(d))$, and with probability at least $9/10$ outputs an $O(t)$-piecewise degree-$d$ hypothesis $h$ that is $4 \cdot \mathrm{OPT} +ε$ close to $f$. Our general algorithm yields (nearly) sample-optimal and nearly-linear time estimators for a wide range of structured distribution families over both continuous and discrete domains in a unified way. For most of our applications, these are the first sample-optimal and nearly-linear time estimators in the literature. As a consequence, our work resolves the sample and computational complexities of a broad class of inference tasks via a single "meta-algorithm". Moreover, we experimentally demonstrate that our algorithm performs very well in practice. Our algorithm consists of three "levels": (i) At the top level, we employ an iterative greedy algorithm for finding a good partition of the real line into the pieces of a piecewise polynomial. (ii) For each piece, we show that the sub-problem of finding a good polynomial fit on the current interval can be solved efficiently with a separation oracle method. (iii) We reduce the task of finding a separating hyperplane to a combinatorial problem and give an efficient algorithm for this problem. Combining these three procedures gives a density estimation algorithm with the claimed guarantees.

preprint2014arXiv

Learning $k$-Modal Distributions via Testing

A $k$-modal probability distribution over the discrete domain $\{1,...,n\}$ is one whose histogram has at most $k$ "peaks" and "valleys." Such distributions are natural generalizations of monotone ($k=0$) and unimodal ($k=1$) probability distributions, which have been intensively studied in probability theory and statistics. In this paper we consider the problem of \emph{learning} (i.e., performing density estimation of) an unknown $k$-modal distribution with respect to the $L_1$ distance. The learning algorithm is given access to independent samples drawn from an unknown $k$-modal distribution $p$, and it must output a hypothesis distribution $\widehat{p}$ such that with high probability the total variation distance between $p$ and $\widehat{p}$ is at most $ε.$ Our main goal is to obtain \emph{computationally efficient} algorithms for this problem that use (close to) an information-theoretically optimal number of samples. We give an efficient algorithm for this problem that runs in time $\mathrm{poly}(k,\log(n),1/ε)$. For $k \leq \tilde{O}(\log n)$, the number of samples used by our algorithm is very close (within an $\tilde{O}(\log(1/ε))$ factor) to being information-theoretically optimal. Prior to this work computationally efficient algorithms were known only for the cases $k=0,1$ \cite{Birge:87b,Birge:97}. A novel feature of our approach is that our learning algorithm crucially uses a new algorithm for \emph{property testing of probability distributions} as a key subroutine. The learning algorithm uses the property tester to efficiently decompose the $k$-modal distribution into $k$ (near-)monotone distributions, which are easier to learn.

preprint2014arXiv

Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms

Let $p$ be an unknown and arbitrary probability distribution over $[0,1)$. We consider the problem of {\em density estimation}, in which a learning algorithm is given i.i.d. draws from $p$ and must (with high probability) output a hypothesis distribution that is close to $p$. The main contribution of this paper is a highly efficient density estimation algorithm for learning using a variable-width histogram, i.e., a hypothesis distribution with a piecewise constant probability density function. In more detail, for any $k$ and $ε$, we give an algorithm that makes $\tilde{O}(k/ε^2)$ draws from $p$, runs in $\tilde{O}(k/ε^2)$ time, and outputs a hypothesis distribution $h$ that is piecewise constant with $O(k \log^2(1/ε))$ pieces. With high probability the hypothesis $h$ satisfies $d_{\mathrm{TV}}(p,h) \leq C \cdot \mathrm{opt}_k(p) + ε$, where $d_{\mathrm{TV}}$ denotes the total variation distance (statistical distance), $C$ is a universal constant, and $\mathrm{opt}_k(p)$ is the smallest total variation distance between $p$ and any $k$-piecewise constant distribution. The sample size and running time of our algorithm are optimal up to logarithmic factors. The "approximation factor" $C$ in our result is inherent in the problem, as we prove that no algorithm with sample size bounded in terms of $k$ and $ε$ can achieve $C<2$ regardless of what kind of hypothesis distribution it uses.

preprint2014arXiv

Testing Identity of Structured Distributions

We study the question of identity testing for structured distributions. More precisely, given samples from a {\em structured} distribution $q$ over $[n]$ and an explicit distribution $p$ over $[n]$, we wish to distinguish whether $q=p$ versus $q$ is at least $ε$-far from $p$, in $L_1$ distance. In this work, we present a unified approach that yields new, simple testers, with sample complexity that is information-theoretically optimal, for broad classes of structured distributions, including $t$-flat distributions, $t$-modal distributions, log-concave distributions, monotone hazard rate (MHR) distributions, and mixtures thereof.

preprint2013arXiv

A Polynomial-time Approximation Scheme for Fault-tolerant Distributed Storage

We consider a problem which has received considerable attention in systems literature because of its applications to routing in delay tolerant networks and replica placement in distributed storage systems. In abstract terms the problem can be stated as follows: Given a random variable $X$ generated by a known product distribution over $\{0,1\}^n$ and a target value $0 \leq θ\leq 1$, output a non-negative vector $w$, with $\|w\|_1 \le 1$, which maximizes the probability of the event $w \cdot X \ge θ$. This is a challenging non-convex optimization problem for which even computing the value $\Pr[w \cdot X \ge θ]$ of a proposed solution vector $w$ is #P-hard. We provide an additive EPTAS for this problem which, for constant-bounded product distributions, runs in $ \poly(n) \cdot 2^{\poly(1/\eps)}$ time and outputs an $\eps$-approximately optimal solution vector $w$ for this problem. Our approach is inspired by, and extends, recent structural results from the complexity-theoretic study of linear threshold functions. Furthermore, in spite of the objective function being non-smooth, we give a \emph{unicriterion} PTAS while previous work for such objective functions has typically led to a \emph{bicriterion} PTAS. We believe our techniques may be applicable to get unicriterion PTAS for other non-smooth objective functions.

preprint2013arXiv

A robust Khintchine inequality, and algorithms for computing optimal constants in Fourier analysis and high-dimensional geometry

This paper makes two contributions towards determining some well-studied optimal constants in Fourier analysis \newa{of Boolean functions} and high-dimensional geometry. \begin{enumerate} \item It has been known since 1994 \cite{GL:94} that every linear threshold function has squared Fourier mass at least 1/2 on its degree-0 and degree-1 coefficients. Denote the minimum such Fourier mass by $\w^{\leq 1}[\ltf]$, where the minimum is taken over all $n$-variable linear threshold functions and all $n \ge 0$. Benjamini, Kalai and Schramm \cite{BKS:99} have conjectured that the true value of $\w^{\leq 1}[\ltf]$ is $2/π$. We make progress on this conjecture by proving that $\w^{\leq 1}[\ltf] \geq 1/2 + c$ for some absolute constant $c>0$. The key ingredient in our proof is a "robust" version of the well-known Khintchine inequality in functional analysis, which we believe may be of independent interest. \item We give an algorithm with the following property: given any $η> 0$, the algorithm runs in time $2^{\poly(1/η)}$ and determines the value of $\w^{\leq 1}[\ltf]$ up to an additive error of $\pmη$. We give a similar $2^{\poly(1/η)}$-time algorithm to determine \emph{Tomaszewski's constant} to within an additive error of $\pm η$; this is the minimum (over all origin-centered hyperplanes $H$) fraction of points in $\{-1,1\}^n$ that lie within Euclidean distance 1 of $H$. Tomaszewski's constant is conjectured to be 1/2; lower bounds on it have been given by Holzman and Kleitman \cite{HK92} and independently by Ben-Tal, Nemirovski and Roos \cite{BNR02}. Our algorithms combine tools from anti-concentration of sums of independent random variables, Fourier analysis, and Hermite analysis of linear threshold functions. \end{enumerate}

preprint2013arXiv

Deterministic Approximate Counting for Degree-$2$ Polynomial Threshold Functions

We give a {\em deterministic} algorithm for approximately computing the fraction of Boolean assignments that satisfy a degree-$2$ polynomial threshold function. Given a degree-2 input polynomial $p(x_1,\dots,x_n)$ and a parameter $\eps > 0$, the algorithm approximates \[ \Pr_{x \sim \{-1,1\}^n}[p(x) \geq 0] \] to within an additive $\pm \eps$ in time $\poly(n,2^{\poly(1/\eps)})$. Note that it is NP-hard to determine whether the above probability is nonzero, so any sort of multiplicative approximation is almost certainly impossible even for efficient randomized algorithms. This is the first deterministic algorithm for this counting problem in which the running time is polynomial in $n$ for $\eps= o(1)$. For "regular" polynomials $p$ (those in which no individual variable's influence is large compared to the sum of all $n$ variable influences) our algorithm runs in $\poly(n,1/\eps)$ time. The algorithm also runs in $\poly(n,1/\eps)$ time to approximate $\Pr_{x \sim N(0,1)^n}[p(x) \geq 0]$ to within an additive $\pm \eps$, for any degree-2 polynomial $p$. As an application of our counting result, we give a deterministic FPT multiplicative $(1 \pm \eps)$-approximation algorithm to approximate the $k$-th absolute moment $\E_{x \sim \{-1,1\}^n}[|p(x)^k|]$ of a degree-2 polynomial. The algorithm's running time is of the form $\poly(n) \cdot f(k,1/\eps)$.

preprint2013arXiv

Deterministic Approximate Counting for Juntas of Degree-$2$ Polynomial Threshold Functions

Let $g: \{-1,1\}^k \to \{-1,1\}$ be any Boolean function and $q_1,\dots,q_k$ be any degree-2 polynomials over $\{-1,1\}^n.$ We give a \emph{deterministic} algorithm which, given as input explicit descriptions of $g,q_1,\dots,q_k$ and an accuracy parameter $\eps>0$, approximates \[\Pr_{x \sim \{-1,1\}^n}[g(\sign(q_1(x)),\dots,\sign(q_k(x)))=1]\] to within an additive $\pm \eps$. For any constant $\eps > 0$ and $k \geq 1$ the running time of our algorithm is a fixed polynomial in $n$. This is the first fixed polynomial-time algorithm that can deterministically approximately count satisfying assignments of a natural class of depth-3 Boolean circuits. Our algorithm extends a recent result \cite{DDS13:deg2count} which gave a deterministic approximate counting algorithm for a single degree-2 polynomial threshold function $\sign(q(x)),$ corresponding to the $k=1$ case of our result. Our algorithm and analysis requires several novel technical ingredients that go significantly beyond the tools required to handle the $k=1$ case in \cite{DDS13:deg2count}. One of these is a new multidimensional central limit theorem for degree-2 polynomials in Gaussian random variables which builds on recent Malliavin-calculus-based results from probability theory. We use this CLT as the basis of a new decomposition technique for $k$-tuples of degree-2 Gaussian polynomials and thus obtain an efficient deterministic approximate counting algorithm for the Gaussian distribution. Finally, a third new ingredient is a "regularity lemma" for \emph{$k$-tuples} of degree-$d$ polynomial threshold functions. This generalizes both the regularity lemmas of \cite{DSTW:10,HKM:09} and the regularity lemma of Gopalan et al \cite{GOWZ10}. Our new regularity lemma lets us extend our deterministic approximate counting results from the Gaussian to the Boolean domain.

preprint2013arXiv

Efficient Density Estimation via Piecewise Polynomial Approximation

We give a highly efficient "semi-agnostic" algorithm for learning univariate probability distributions that are well approximated by piecewise polynomial density functions. Let $p$ be an arbitrary distribution over an interval $I$ which is $τ$-close (in total variation distance) to an unknown probability distribution $q$ that is defined by an unknown partition of $I$ into $t$ intervals and $t$ unknown degree-$d$ polynomials specifying $q$ over each of the intervals. We give an algorithm that draws $\tilde{O}(t\new{(d+1)}/\eps^2)$ samples from $p$, runs in time $\poly(t,d,1/\eps)$, and with high probability outputs a piecewise polynomial hypothesis distribution $h$ that is $(O(τ)+\eps)$-close (in total variation distance) to $p$. This sample complexity is essentially optimal; we show that even for $τ=0$, any algorithm that learns an unknown $t$-piecewise degree-$d$ probability distribution over $I$ to accuracy $\eps$ must use $Ω({\frac {t(d+1)} {\poly(1 + \log(d+1))}} \cdot {\frac 1 {\eps^2}})$ samples from the distribution, regardless of its running time. Our algorithm combines tools from approximation theory, uniform convergence, linear programming, and dynamic programming. We apply this general algorithm to obtain a wide range of results for many natural problems in density estimation over both continuous and discrete domains. These include state-of-the-art results for learning mixtures of log-concave distributions; mixtures of $t$-modal distributions; mixtures of Monotone Hazard Rate distributions; mixtures of Poisson Binomial Distributions; mixtures of Gaussians; and mixtures of $k$-monotone densities. Our general technique yields computationally efficient algorithms for all these problems, in many cases with provably optimal sample complexities (up to logarithmic factors) in all parameters.

preprint2013arXiv

How good is the Chord algorithm?

The Chord algorithm is a popular, simple method for the succinct approximation of curves, which is widely used, under different names, in a variety of areas, such as, multiobjective and parametric optimization, computational geometry, and graphics. We analyze the performance of the Chord algorithm, as compared to the optimal approximation that achieves a desired accuracy with the minimum number of points. We prove sharp upper and lower bounds, both in the worst case and average case setting.

preprint2013arXiv

Optimal Algorithms for Testing Closeness of Discrete Distributions

We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions $p$ and $q$ over an $n$-element set, we wish to distinguish whether $p=q$ versus $p$ is at least $\eps$-far from $q$, in either $\ell_1$ or $\ell_2$ distance. Batu et al. gave the first sub-linear time algorithms for these problems, which matched the lower bounds of Valiant up to a logarithmic factor in $n$, and a polynomial factor of $\eps.$ In this work, we present simple (and new) testers for both the $\ell_1$ and $\ell_2$ settings, with sample complexity that is information-theoretically optimal, to constant factors, both in the dependence on $n$, and the dependence on $\eps$; for the $\ell_1$ testing problem we establish that the sample complexity is $Θ(\max\{n^{2/3}/\eps^{4/3}, n^{1/2}/\eps^2 \}).$

preprint2012arXiv

Efficiency-Revenue Trade-offs in Auctions

When agents with independent priors bid for a single item, Myerson's optimal auction maximizes expected revenue, whereas Vickrey's second-price auction optimizes social welfare. We address the natural question of trade-offs between the two criteria, that is, auctions that optimize, say, revenue under the constraint that the welfare is above a given level. If one allows for randomized mechanisms, it is easy to see that there are polynomial-time mechanisms that achieve any point in the trade-off (the Pareto curve) between revenue and welfare. We investigate whether one can achieve the same guarantees using deterministic mechanisms. We provide a negative answer to this question by showing that this is a (weakly) NP-hard problem. On the positive side, we provide polynomial-time deterministic mechanisms that approximate with arbitrary precision any point of the trade-off between these two fundamental objectives for the case of two bidders, even when the valuations are correlated arbitrarily. The major problem left open by our work is whether there is such an algorithm for three or more bidders with independent valuation distributions.

preprint2012arXiv

Inverse problems in approximate uniform generation

We initiate the study of \emph{inverse} problems in approximate uniform generation, focusing on uniform generation of satisfying assignments of various types of Boolean functions. In such an inverse problem, the algorithm is given uniform random satisfying assignments of an unknown function $f$ belonging to a class $\C$ of Boolean functions, and the goal is to output a probability distribution $D$ which is $ε$-close, in total variation distance, to the uniform distribution over $f^{-1}(1)$. Positive results: We prove a general positive result establishing sufficient conditions for efficient inverse approximate uniform generation for a class $\C$. We define a new type of algorithm called a \emph{densifier} for $\C$, and show (roughly speaking) how to combine (i) a densifier, (ii) an approximate counting / uniform generation algorithm, and (iii) a Statistical Query learning algorithm, to obtain an inverse approximate uniform generation algorithm. We apply this general result to obtain a poly$(n,1/\eps)$-time algorithm for the class of halfspaces; and a quasipoly$(n,1/\eps)$-time algorithm for the class of $\poly(n)$-size DNF formulas. Negative results: We prove a general negative result establishing that the existence of certain types of signature schemes in cryptography implies the hardness of certain inverse approximate uniform generation problems. This implies that there are no {subexponential}-time inverse approximate uniform generation algorithms for 3-CNF formulas; for intersections of two halfspaces; for degree-2 polynomial threshold functions; and for monotone 2-CNF formulas. Finally, we show that there is no general relationship between the complexity of the "forward" approximate uniform generation problem and the complexity of the inverse problem for a class $\C$ -- it is possible for either one to be easy while the other is hard.

preprint2012arXiv

Learning mixtures of structured distributions over discrete domains

Let $\mathfrak{C}$ be a class of probability distributions over the discrete domain $[n] = \{1,...,n\}.$ We show that if $\mathfrak{C}$ satisfies a rather general condition -- essentially, that each distribution in $\mathfrak{C}$ can be well-approximated by a variable-width histogram with few bins -- then there is a highly efficient (both in terms of running time and sample complexity) algorithm that can learn any mixture of $k$ unknown distributions from $\mathfrak{C}.$ We analyze several natural types of distributions over $[n]$, including log-concave, monotone hazard rate and unimodal distributions, and show that they have the required structural property of being well-approximated by a histogram with few bins. Applying our general algorithm, we obtain near-optimally efficient algorithms for all these mixture learning problems.

preprint2012arXiv

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces

The \emph{Chow parameters} of a Boolean function $f: \{-1,1\}^n \to \{-1,1\}$ are its $n+1$ degree-0 and degree-1 Fourier coefficients. It has been known since 1961 (Chow, Tannenbaum) that the (exact values of the) Chow parameters of any linear threshold function $f$ uniquely specify $f$ within the space of all Boolean functions, but until recently (O'Donnell and Servedio) nothing was known about efficient algorithms for \emph{reconstructing} $f$ (exactly or approximately) from exact or approximate values of its Chow parameters. We refer to this reconstruction problem as the \emph{Chow Parameters Problem.} Our main result is a new algorithm for the Chow Parameters Problem which, given (sufficiently accurate approximations to) the Chow parameters of any linear threshold function $f$, runs in time $\tilde{O}(n^2)\cdot (1/\eps)^{O(\log^2(1/\eps))}$ and with high probability outputs a representation of an LTF $f'$ that is $\eps$-close to $f$. The only previous algorithm (O'Donnell and Servedio) had running time $\poly(n) \cdot 2^{2^{\tilde{O}(1/\eps^2)}}.$ As a byproduct of our approach, we show that for any linear threshold function $f$ over $\{-1,1\}^n$, there is a linear threshold function $f'$ which is $\eps$-close to $f$ and has all weights that are integers at most $\sqrt{n} \cdot (1/\eps)^{O(\log^2(1/\eps))}$. This significantly improves the best previous result of Diakonikolas and Servedio which gave a $\poly(n) \cdot 2^{\tilde{O}(1/\eps^{2/3})}$ weight bound, and is close to the known lower bound of $\max\{\sqrt{n},$ $(1/\eps)^{Ω(\log \log (1/\eps))}\}$ (Goldberg, Servedio). Our techniques also yield improved algorithms for related problems in learning theory.

preprint2012arXiv

On the Distribution of the Fourier Spectrum of Halfspaces

Bourgain showed that any noise stable Boolean function $f$ can be well-approximated by a junta. In this note we give an exponential sharpening of the parameters of Bourgain's result under the additional assumption that $f$ is a halfspace.

preprint2012arXiv

The Inverse Shapley Value Problem

For $f$ a weighted voting scheme used by $n$ voters to choose between two candidates, the $n$ \emph{Shapley-Shubik Indices} (or {\em Shapley values}) of $f$ provide a measure of how much control each voter can exert over the overall outcome of the vote. Shapley-Shubik indices were introduced by Lloyd Shapley and Martin Shubik in 1954 \cite{SS54} and are widely studied in social choice theory as a measure of the "influence" of voters. The \emph{Inverse Shapley Value Problem} is the problem of designing a weighted voting scheme which (approximately) achieves a desired input vector of values for the Shapley-Shubik indices. Despite much interest in this problem no provably correct and efficient algorithm was known prior to our work. We give the first efficient algorithm with provable performance guarantees for the Inverse Shapley Value Problem. For any constant $\eps > 0$ our algorithm runs in fixed poly$(n)$ time (the degree of the polynomial is independent of $\eps$) and has the following performance guarantee: given as input a vector of desired Shapley values, if any "reasonable" weighted voting scheme (roughly, one in which the threshold is not too skewed) approximately matches the desired vector of values to within some small error, then our algorithm explicitly outputs a weighted voting scheme that achieves this vector of Shapley values to within error $\eps.$ If there is a "reasonable" voting scheme in which all voting weights are integers at most $\poly(n)$ that approximately achieves the desired Shapley values, then our algorithm runs in time $\poly(n)$ and outputs a weighted voting scheme that achieves the target vector of Shapley values to within error $\eps=n^{-1/8}.$

preprint2011arXiv

Learning transformed product distributions

We consider the problem of learning an unknown product distribution $X$ over $\{0,1\}^n$ using samples $f(X)$ where $f$ is a \emph{known} transformation function. Each choice of a transformation function $f$ specifies a learning problem in this framework. Information-theoretic arguments show that for every transformation function $f$ the corresponding learning problem can be solved to accuracy $\eps$, using $\tilde{O}(n/\eps^2)$ examples, by a generic algorithm whose running time may be exponential in $n.$ We show that this learning problem can be computationally intractable even for constant $\eps$ and rather simple transformation functions. Moreover, the above sample complexity bound is nearly optimal for the general problem, as we give a simple explicit linear transformation function $f(x)=w \cdot x$ with integer weights $w_i \leq n$ and prove that the corresponding learning problem requires $Ω(n)$ samples. As our main positive result we give a highly efficient algorithm for learning a sum of independent unknown Bernoulli random variables, corresponding to the transformation function $f(x)= \sum_{i=1}^n x_i$. Our algorithm learns to $\eps$-accuracy in poly$(n)$ time, using a surprising poly$(1/\eps)$ number of samples that is independent of $n.$ We also give an efficient algorithm that uses $\log n \cdot \poly(1/\eps)$ samples but has running time that is only $\poly(\log n, 1/\eps).$

preprint2011arXiv

Testing $k$-Modal Distributions: Optimal Algorithms via Reductions

We give highly efficient algorithms, and almost matching lower bounds, for a range of basic statistical problems that involve testing and estimating the L_1 distance between two k-modal distributions $p$ and $q$ over the discrete domain $\{1,\dots,n\}$. More precisely, we consider the following four problems: given sample access to an unknown k-modal distribution $p$, Testing identity to a known or unknown distribution: 1. Determine whether $p = q$ (for an explicitly given k-modal distribution $q$) versus $p$ is $\eps$-far from $q$; 2. Determine whether $p=q$ (where $q$ is available via sample access) versus $p$ is $\eps$-far from $q$; Estimating $L_1$ distance ("tolerant testing'') against a known or unknown distribution: 3. Approximate $d_{TV}(p,q)$ to within additive $\eps$ where $q$ is an explicitly given k-modal distribution $q$; 4. Approximate $d_{TV}(p,q)$ to within additive $\eps$ where $q$ is available via sample access. For each of these four problems we give sub-logarithmic sample algorithms, that we show are tight up to additive $\poly(k)$ and multiplicative $\polylog\log n+\polylog k$ factors. Thus our bounds significantly improve the previous results of \cite{BKR:04}, which were for testing identity of distributions (items (1) and (2) above) in the special cases k=0 (monotone distributions) and k=1 (unimodal distributions) and required $O((\log n)^3)$ samples. As our main conceptual contribution, we introduce a new reduction-based approach for distribution-testing problems that lets us obtain all the above results in a unified way. Roughly speaking, this approach enables us to transform various distribution testing problems for k-modal distributions over $\{1,\dots,n\}$ to the corresponding distribution testing problems for unrestricted distributions over a much smaller domain $\{1,\dots,\ell\}$ where $\ell = O(k \log n).$

preprint2010arXiv

A regularity lemma, and low-weight approximators, for low-degree polynomial threshold functions

We give a "regularity lemma" for degree-d polynomial threshold functions (PTFs) over the Boolean cube {-1,1}^n. This result shows that every degree-d PTF can be decomposed into a constant number of subfunctions such that almost all of the subfunctions are close to being regular PTFs. Here a "regular PTF is a PTF sign(p(x)) where the influence of each variable on the polynomial p(x) is a small fraction of the total influence of p. As an application of this regularity lemma, we prove that for any constants d \geq 1, \eps \geq 0, every degree-d PTF over n variables has can be approximated to accuracy eps by a constant-degree PTF that has integer weights of total magnitude O(n^d). This weight bound is shown to be optimal up to constant factors.

preprint2010arXiv

Bounded Independence Fools Degree-2 Threshold Functions

Let x be a random vector coming from any k-wise independent distribution over {-1,1}^n. For an n-variate degree-2 polynomial p, we prove that E[sgn(p(x))] is determined up to an additive epsilon for k = poly(1/epsilon). This answers an open question of Diakonikolas et al. (FOCS 2009). Using standard constructions of k-wise independent distributions, we obtain a broad class of explicit generators that epsilon-fool the class of degree-2 threshold functions with seed length log(n)*poly(1/epsilon). Our approach is quite robust: it easily extends to yield that the intersection of any constant number of degree-2 threshold functions is epsilon-fooled by poly(1/epsilon)-wise independence. Our results also hold if the entries of x are k-wise independent standard normals, implying for example that bounded independence derandomizes the Goemans-Williamson hyperplane rounding scheme. To achieve our results, we introduce a technique we dub multivariate FT-mollification, a generalization of the univariate form introduced by Kane et al. (SODA 2010) in the context of streaming algorithms. Along the way we prove a generalized hypercontractive inequality for quadratic forms which takes the operator norm of the associated matrix into account. These techniques may be of independent interest.

preprint2010arXiv

Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions

Hardness results for maximum agreement problems have close connections to hardness results for proper learning in computational learning theory. In this paper we prove two hardness results for the problem of finding a low degree polynomial threshold function (PTF) which has the maximum possible agreement with a given set of labeled examples in $\R^n \times \{-1,1\}.$ We prove that for any constants $d\geq 1, \eps > 0$, {itemize} Assuming the Unique Games Conjecture, no polynomial-time algorithm can find a degree-$d$ PTF that is consistent with a $(\half + \eps)$ fraction of a given set of labeled examples in $\R^n \times \{-1,1\}$, even if there exists a degree-$d$ PTF that is consistent with a $1-\eps$ fraction of the examples. It is $\NP$-hard to find a degree-2 PTF that is consistent with a $(\half + \eps)$ fraction of a given set of labeled examples in $\R^n \times \{-1,1\}$, even if there exists a halfspace (degree-1 PTF) that is consistent with a $1 - \eps$ fraction of the examples. {itemize} These results immediately imply the following hardness of learning results: (i) Assuming the Unique Games Conjecture, there is no better-than-trivial proper learning algorithm that agnostically learns degree-$d$ PTFs under arbitrary distributions; (ii) There is no better-than-trivial learning algorithm that outputs degree-2 PTFs and agnostically learns halfspaces (i.e. degree-1 PTFs) under arbitrary distributions.

Ilias Diakonikolas

What is connected

Connect this record

See the researcher in context

Building this map preview

56 published item(s)

Cryptographic Hardness of Learning Halfspaces with Massart Noise

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

Near-Optimal Bounds for Testing Histogram Distributions

Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models

ReLU Regression with Massart Noise

Agnostic Proper Learning of Halfspaces under Gaussian Marginals

Outlier-Robust Learning of Ising Models Under Dobrushin's Condition

The Optimality of Polynomial Regression for Agnostic Learning under Gaussian Marginals

Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks

Efficient Algorithms for Multidimensional Segmented Regression

Efficiently Learning Adversarially Robust Halfspaces with Noise

High-Dimensional Robust Mean Estimation via Gradient Descent

Learning Halfspaces with Massart Noise Under Structured Distributions

Learning Halfspaces with Tsybakov Noise

List-Decodable Mean Estimation via Iterative Multi-Filtering

Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals

Non-Convex SGD Learns Halfspaces with Adversarial Label Noise

Optimal Testing of Discrete Distributions with High Probability

Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees

Robustly Learning any Clusterable Mixture of Gaussians

Testing Bayesian Networks

The Complexity of Adversarially Robust Proper Learning of Halfspaces with Agnostic Noise

The Sample Complexity of Robust Covariance Testing

A New Approach for Testing Properties of Discrete Distributions

Collision-based Testers are Optimal for Uniformity and Closeness

Fast Algorithms for Segmented Regression

Near-Optimal Disjoint-Path Facility Location Through Set Cover by Pairs

Playing Anonymous Games using Simple Strategies

Testing Shape Restrictions of Discrete Distributions

The Fourier Transform of Poisson Multinomial Distributions and its Algorithmic Applications

Learning Poisson Binomial Distributions

Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions

Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables

Properly Learning Poisson Binomial Distributions in Almost Polynomial Time

Sample-Optimal Density Estimation in Nearly-Linear Time

Learning $k$-Modal Distributions via Testing

Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms

Testing Identity of Structured Distributions

A Polynomial-time Approximation Scheme for Fault-tolerant Distributed Storage

A robust Khintchine inequality, and algorithms for computing optimal constants in Fourier analysis and high-dimensional geometry

Deterministic Approximate Counting for Degree-$2$ Polynomial Threshold Functions

Deterministic Approximate Counting for Juntas of Degree-$2$ Polynomial Threshold Functions

Efficient Density Estimation via Piecewise Polynomial Approximation

How good is the Chord algorithm?

Optimal Algorithms for Testing Closeness of Discrete Distributions

Efficiency-Revenue Trade-offs in Auctions

Inverse problems in approximate uniform generation

Learning mixtures of structured distributions over discrete domains

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces

On the Distribution of the Fourier Spectrum of Halfspaces

The Inverse Shapley Value Problem

Learning transformed product distributions

Testing $k$-Modal Distributions: Optimal Algorithms via Reductions

A regularity lemma, and low-weight approximators, for low-degree polynomial threshold functions

Bounded Independence Fools Degree-2 Threshold Functions

Hardness Results for Agnostically Learning Low-Degree Polynomial Threshold Functions