Researcher profile

Ilias Zadik

Ilias Zadik contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2022arXiv

Almost-Linear Planted Cliques Elude the Metropolis Process

A seminal work of Jerrum (1992) showed that large cliques elude the Metropolis process. More specifically, Jerrum showed that the Metropolis algorithm cannot find a clique of size $k=Θ(n^α), α\in (0,1/2)$, which is planted in the Erdős-Rényi random graph $G(n,1/2)$, in polynomial time. Information theoretically it is possible to find such planted cliques as soon as $k \ge (2+ε) \log n$. Since the work of Jerrum, the computational problem of finding a planted clique in $G(n,1/2)$ was studied extensively and many polynomial time algorithms were shown to find the planted clique if it is of size $k = Ω(\sqrt{n})$, while no polynomial-time algorithm is known to work when $k=o(\sqrt{n})$. Notably, the first evidence of the problem&#39;s algorithmic hardness is commonly attributed to the result of Jerrum from 1992. In this paper we revisit the original Metropolis algorithm suggested by Jerrum. Interestingly, we find that the Metropolis algorithm actually fails to recover a planted clique of size $k=Θ(n^α)$ for any constant $0 \leq α< 1$. Moreover, we strengthen Jerrum&#39;s results in a number of other ways including: Like many results in the MCMC literature, the result of Jerrum shows that there exists a starting state (which may depend on the instance) for which the Metropolis algorithm fails. For a wide range of temperatures, we show that the algorithm fails when started at the most natural initial state, which is the empty clique. This answers an open problem stated in Jerrum (1992). We also show that the simulated tempering version of the Metropolis algorithm, a more sophisticated temperature-exchange variant of it, also fails at the same regime of parameters. Finally, our results confirm recent predictions by Gamarnik and Zadik (2019) and Angelini, Fachin, de Feo (2021).

preprint2022arXiv

Archimedes Meets Privacy: On Privately Estimating Quantiles in High Dimensions Under Minimal Assumptions

The last few years have seen a surge of work on high dimensional statistics under privacy constraints, mostly following two main lines of work: the ``worst case&#39;&#39; line, which does not make any distributional assumptions on the input data; and the ``strong assumptions&#39;&#39; line, which assumes that the data is generated from specific families, e.g., subgaussian distributions. In this work we take a middle ground, obtaining new differentially private algorithms with polynomial sample complexity for estimating quantiles in high-dimensions, as well as estimating and sampling points of high Tukey depth, all working under very mild distributional assumptions. From the technical perspective, our work relies upon deep robustness results in the convex geometry literature, demonstrating how such results can be used in a private context. Our main object of interest is the (convex) floating body (FB), a notion going back to Archimedes, which is a robust and well studied high-dimensional analogue of the interquantile range. We show how one can privately, and with polynomially many samples, (a) output an approximate interior point of the FB -- e.g., ``a typical user&#39;&#39; in a high-dimensional database -- by leveraging the robustness of the Steiner point of the FB; and at the expense of polynomially many more samples, (b) produce an approximate uniform sample from the FB, by constructing a private noisy projection oracle.

preprint2022arXiv

Lattice-Based Methods Surpass Sum-of-Squares in Clustering

Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. &#39;20; Mao, Wein &#39;21; Davis, Diaz, Wang &#39;21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, even though the aforementioned low-degree and SoS lower bounds continue to apply in this case. To achieve this, we give a polynomial-time algorithm based on the Lenstra--Lenstra--Lovasz lattice basis reduction method which achieves the statistically-optimal sample complexity of d+1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be &#34;closed&#34; by &#34;brittle&#34; polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps.

preprint2022arXiv

On the Second Kahn--Kalai Conjecture

For any given graph $H$, we are interested in $p_\mathrm{crit}(H)$, the minimal $p$ such that the Erdős-Rényi graph $G(n,p)$ contains a copy of $H$ with probability at least $1/2$. Kahn and Kalai (2007) conjectured that $p_\mathrm{crit}(H)$ is given up to a logarithmic factor by a simpler &#34;subgraph expectation threshold&#34; $p_\mathrm{E}(H)$, which is the minimal $p$ such that for every subgraph $H&#39;\subseteq H$, the Erdős-Rényi graph $G(n,p)$ contains \emph{in expectation} at least $1/2$ copies of $H&#39;$. It is trivial that $p_\mathrm{E}(H) \le p_\mathrm{crit}(H)$, and the so-called &#34;second Kahn-Kalai conjecture&#34; states that $p_\mathrm{crit}(H) \lesssim p_\mathrm{E}(H) \log e(H)$ where $e(H)$ is the number of edges in $H$. In this article, we present a natural modification $p_\mathrm{E, new}(H)$ of the Kahn--Kalai subgraph expectation threshold, which we show is sandwiched between $p_\mathrm{E}(H)$ and $p_\mathrm{crit}(H)$. The new definition $p_\mathrm{E, new}(H)$ is based on the simple observation that if $G(n,p)$ contains a copy of $H$ and $H$ contains \emph{many} copies of $H&#39;$, then $G(n,p)$ must also contain \emph{many} copies of $H&#39;$. We then show that $p_\mathrm{crit}(H) \lesssim p_\mathrm{E, new}(H) \log e(H)$, thus proving a modification of the second Kahn--Kalai conjecture. The bound follows by a direct application of the set-theoretic &#34;spread&#34; property, which led to recent breakthroughs in the sunflower conjecture by Alweiss, Lovett, Wu and Zhang and the first fractional Kahn--Kalai conjecture by Frankston, Kahn, Narayanan and Park.

preprint2022arXiv

Statistical and Computational Phase Transitions in Group Testing

We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease within a population of size n, based on the outcomes of pooled tests which return positive whenever there is at least one infected individual in the tested group. We consider two different simple random procedures for assigning individuals to tests: the constant-column design and Bernoulli design. Our first set of results concerns the fundamental statistical limits. For the constant-column design, we give a new information-theoretic lower bound which implies that the proportion of correctly identifiable infected individuals undergoes a sharp &#34;all-or-nothing&#34; phase transition when the number of tests crosses a particular threshold. For the Bernoulli design, we determine the precise number of tests required to solve the associated detection problem (where the goal is to distinguish between a group testing instance and pure noise), improving both the upper and lower bounds of Truong, Aldridge, and Scarlett (2020). For both group testing models, we also study the power of computationally efficient (polynomial-time) inference procedures. We determine the precise number of tests required for the class of low-degree polynomial algorithms to solve the detection problem. This provides evidence for an inherent computational-statistical gap in both the detection and recovery problems at small sparsity levels. Notably, our evidence is contrary to that of Iliopoulos and Zadik (2021), who predicted the absence of a computational-statistical gap in the Bernoulli design.

preprint2021arXiv

Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural Networks

We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that &#34;fits&#34; a training data set as accurately as possible as quantified by the training error; and study the following question: \emph{does a low training error guarantee that the norm of the output layer (outer norm) itself is small?} We answer affirmatively this question for the case of non-negative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a well-controlled outer norm. Notably, our results (a) have a polynomial (in $d$) sample complexity, (b) are independent of the number of hidden units (which can potentially be very high), (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector $X\in\mathbb{R}^d$ need not have independent coordinates). We then leverage our bounds to establish generalization guarantees for such networks through \emph{fat-shattering dimension}, a scale-sensitive measure of the complexity class that the network architectures we investigate belong to. Notably, our generalization bounds also have good sample complexity (polynomials in $d$ with a low degree), and are in fact near-linear for some important cases of interest.

preprint2021arXiv

Shapes and recession cones in mixed-integer convex representability

Mixed-integer convex representable (MICP-R) sets are those sets that can be represented exactly through a mixed-integer convex programming formulation. Following up on recent work by Lubin et al. (2017, 2020) we investigate structural geometric properties of MICP-R sets, which strongly differentiate them from the class of mixed-integer linear representable sets (MILP-R). First, we provide an example of an MICP-R set which is the countably infinite union of convex sets with countably infinitely many different recession cones. This is in sharp contrast with MILP-R sets which are at most infinite unions of polyhedra that share the same recession cone. Second, we provide an example of an MICP-R set which is the countably infinite union of polytopes all of which have different shapes (no pair is combinatorially equivalent, which implies they are not affine transformations of each other). Again, this is in sharp contrast with MILP-R sets which are at most infinite unions of polyhedra that are all translations of a finite subset of themselves. Interestingly, we show that a countably infinite union of convex sets sharing the same volume can be MICP-R only if the sets are all translations of a finite subset of themselves (i.e. the natural conceptual analogue to the MILP-R case).

preprint2020arXiv

Free Energy Wells and Overlap Gap Property in Sparse PCA

We study a variant of the sparse PCA (principal component analysis) problem in the &#34;hard&#34; regime, where the inference task is possible yet no polynomial-time algorithm is known to exist. Prior work, based on the low-degree likelihood ratio, has conjectured a precise expression for the best possible (sub-exponential) runtime throughout the hard regime. Following instead a statistical physics inspired point of view, we show bounds on the depth of free energy wells for various Gibbs measures naturally associated to the problem. These free energy wells imply hitting time lower bounds that corroborate the low-degree conjecture: we show that a class of natural MCMC (Markov chain Monte Carlo) methods (with worst-case initialization) cannot solve sparse PCA with less than the conjectured runtime. These lower bounds apply to a wide range of values for two tuning parameters: temperature and sparsity misparametrization. Finally, we prove that the Overlap Gap Property (OGP), a structural property that implies failure of certain local search algorithms, holds in a significant part of the hard regime.

preprint2020arXiv

Neural Networks and Polynomial Regression. Demystifying the Overparametrization Phenomena

In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly exceeds the sample sizes, and the model perfectly fits the in-training data. A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data. In this paper we prove a series of results which provide a somewhat diverging explanation. Adopting a teacher/student model where the teacher network is used to generate the predictions and student network is trained on the observed labeled data, and then tested on out-of-sample data, we show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension and approximation guarantee alone, regardless of the number of internal nodes of either teacher or student network. Our claim is based on approximating both teacher and student networks by polynomial (tensor) regression models with degree depending on the desired accuracy and network depth only. Such a parametrization notably does not depend on the number of internal nodes. Thus a message implied by our results is that parametrizing wide neural networks by the number of hidden nodes is misleading, and a more fitting measure of parametrization complexity is the number of regression coefficients associated with tensorized data. In particular, this somewhat reconciles the generalization ability of neural networks with more classical statistical notions of data complexity and generalization bounds. Our empirical results on MNIST and Fashion-MNIST datasets indeed confirm that tensorized regression achieves a good out-of-sample performance, even when the degree of the tensor is at most two.

preprint2020arXiv

Stationary Points of Shallow Neural Networks with Quadratic Activation Function

We consider the teacher-student setting of learning shallow neural networks with quadratic activations and planted weight matrix $W^*\in\mathbb{R}^{m\times d}$, where $m$ is the width of the hidden layer and $d\le m$ is the data dimension. We study the optimization landscape associated with the empirical and the population squared risk of the problem. Under the assumption the planted weights are full-rank we obtain the following results. First, we establish that the landscape of the empirical risk admits an &#34;energy barrier&#34; separating rank-deficient $W$ from $W^*$: if $W$ is rank deficient, then its risk is bounded away from zero by an amount we quantify. We then couple this result by showing that, assuming number $N$ of samples grows at least like a polynomial function of $d$, all full-rank approximate stationary points of the empirical risk are nearly global optimum. These two results allow us to prove that gradient descent, when initialized below the energy barrier, approximately minimizes the empirical risk and recovers the planted weights in polynomial-time. Next, we show that initializing below this barrier is in fact easily achieved when the weights are randomly generated under relatively weak assumptions. We show that provided the network is sufficiently overparametrized, initializing with an appropriate multiple of the identity suffices to obtain a risk below the energy barrier. At a technical level, the last result is a consequence of the semicircle law for the Wishart ensemble and could be of independent interest. Finally, we study the minimizers of the empirical risk and identify a simple necessary and sufficient geometric condition on the training data under which any minimizer has necessarily zero generalization error. We show that as soon as $N\ge N^*=d(d+1)/2$, randomly generated data enjoys this geometric condition almost surely, while that ceases to be true if $N<N^*$.

preprint2020arXiv

The All-or-Nothing Phenomenon in Sparse Tensor PCA

We study the statistical problem of estimating a rank-one sparse tensor corrupted by additive Gaussian noise, a model also known as sparse tensor PCA. We show that for Bernoulli and Bernoulli-Rademacher distributed signals and \emph{for all} sparsity levels which are sublinear in the dimension of the signal, the sparse tensor PCA model exhibits a phase transition called the \emph{all-or-nothing phenomenon}. This is the property that for some signal-to-noise ratio (SNR) $\mathrm{SNR_c}$ and any fixed $ε>0$, if the SNR of the model is below $\left(1-ε\right)\mathrm{SNR_c}$, then it is impossible to achieve any arbitrarily small constant correlation with the hidden signal, while if the SNR is above $\left(1+ε\right)\mathrm{SNR_c}$, then it is possible to achieve almost perfect correlation with the hidden signal. The all-or-nothing phenomenon was initially established in the context of sparse linear regression, and over the last year also in the context of sparse 2-tensor (matrix) PCA, Bernoulli group testing, and generalized linear models. Our results follow from a more general result showing that for any Gaussian additive model with a discrete uniform prior, the all-or-nothing phenomenon follows as a direct outcome of an appropriately defined &#34;near-orthogonality&#34; property of the support of the prior distribution.

preprint2019arXiv

All-or-Nothing Phenomena: From Single-Letter to High Dimensions

We consider the linear regression problem of estimating a $p$-dimensional vector $β$ from $n$ observations $Y = X β+ W$, where $β_j \stackrel{\text{i.i.d.}}{\sim} π$ for a real-valued distribution $π$ with zero mean and unit variance, $X_{ij} \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0,1)$, and $W_i\stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, σ^2)$. In the asymptotic regime where $n/p \to δ$ and $ p/ σ^2 \to \mathsf{snr}$ for two fixed constants $δ, \mathsf{snr}\in (0, \infty)$ as $p \to \infty$, the limiting (normalized) minimum mean-squared error (MMSE) has been characterized by the MMSE of an associated single-letter (additive Gaussian scalar) channel. In this paper, we show that if the MMSE function of the single-letter channel converges to a step function, then the limiting MMSE of estimating $β$ in the linear regression problem converges to a step function which jumps from $1$ to $0$ at a critical threshold. Moreover, we establish that the limiting mean-squared error of the (MSE-optimal) approximate message passing algorithm also converges to a step function with a larger threshold, providing evidence for the presence of a computational-statistical gap between the two thresholds.

preprint2019arXiv

The Landscape of the Planted Clique Problem: Dense subgraphs and the Overlap Gap Property

In this paper we study the computational-statistical gap of the planted clique problem, where a clique of size $k$ is planted in an Erdos Renyi graph $G(n,\frac{1}{2})$ resulting in a graph $G\left(n,\frac{1}{2},k\right)$. The goal is to recover the planted clique vertices by observing $G\left(n,\frac{1}{2},k\right)$ . It is known that the clique can be recovered as long as $k \geq \left(2+ε\right)\log n $ for any $ε>0$, but no polynomial-time algorithm is known for this task unless $k=Ω\left(\sqrt{n} \right)$. Following a statistical-physics inspired point of view as an attempt to understand this computational-statistical gap, we study the landscape of the &#34;sufficiently dense&#34; subgraphs of $G$ and their overlap with the planted clique. Using the first moment method, we study the densest subgraph problems for subgraphs with fixed, but arbitrary, overlap size with the planted clique, and provide evidence of a phase transition for the presence of Overlap Gap Property (OGP) at $k=Θ\left(\sqrt{n}\right)$. OGP is a concept introduced originally in spin glass theory and known to suggest algorithmic hardness when it appears. We establish the presence of OGP when $k$ is a small positive power of $n$ by using a conditional second moment method. As our main technical tool, we establish the first, to the best of our knowledge, concentration results for the $K$-densest subgraph problem for the Erdos-Renyi model $G\left(n,\frac{1}{2}\right)$ when $K=n^{0.5-ε}$ for arbitrary $ε>0$. Finally, to study the OGP we employ a certain form of overparametrization, which is conceptually aligned with a large body of recent work in learning theory and optimization.