Source author record

Shahar Mendelson

Shahar Mendelson appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory math.PR Machine Learning math.FA Information Theory math.IT math.NA math.OC q-fin.MF

Catalog footprint

What is connected

35works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Fast metric embedding into the Hamming cube

We consider the problem of embedding a subset of $\mathbb{R}^n$ into a low-dimensional Hamming cube in an almost isometric way. We construct a simple, data-oblivious, and computationally efficient map that achieves this task with high probability: we first apply a specific structured random matrix, which we call the double circulant matrix; using that matrix requires linear storage and matrix-vector multiplication can be performed in near-linear time. We then binarize each vector by comparing each of its entries to a random threshold, selected uniformly at random from a well-chosen interval. We estimate the number of bits required for this encoding scheme in terms of two natural geometric complexity parameters of the set - its Euclidean covering numbers and its localized Gaussian complexity. The estimate we derive turns out to be the best that one can hope for - up to logarithmic terms. The key to the proof is a phenomenon of independent interest: we show that the double circulant matrix mimics the behavior of a Gaussian matrix in two important ways. First, it maps an arbitrary set in $\mathbb{R}^n$ into a set of well-spread vectors. Second, it yields a fast near-isometric embedding of any finite subset of $\ell_2^n$ into $\ell_1^m$. This embedding achieves the same dimension reduction as a Gaussian matrix in near-linear time, under an optimal condition - up to logarithmic factors - on the number of points to be embedded. This improves a well-known construction due to Ailon and Chazelle.

preprint2022arXiv

On Monte-Carlo methods in convex stochastic optimization

We develop a novel procedure for estimating the optimizer of general convex stochastic optimization problems of the form $\min_{x\in\mathcal{X}} \mathbb{E}[F(x,ξ)]$, when the given data is a finite independent sample selected according to $ξ$. The procedure is based on a median-of-means tournament, and is the first procedure that exhibits the optimal statistical performance in heavy tailed situations: we recover the asymptotic rates dictated by the central limit theorem in a non-asymptotic manner once the sample size exceeds some explicitly computable threshold. Additionally, our results apply in the high-dimensional setup, as the threshold sample size exhibits the optimal dependence on the dimension (up to a logarithmic factor). The general setting allows us to recover recent results on multivariate mean estimation and linear regression in heavy-tailed situations and to prove the first sharp, non-asymptotic results for the portfolio optimization problem.

preprint2022arXiv

Random embeddings with an almost Gaussian distortion

Let $X$ be a symmetric, isotropic random vector in $\mathbb{R}^m$ and let $X_1...,X_n$ be independent copies of $X$. We show that under mild assumptions on $\|X\|_2$ (a suitable thin-shell bound) and on the tail-decay of the marginals $\langle X,u\rangle$, the random matrix $A$, whose columns are $X_i/\sqrt{m}$ exhibits a Gaussian-like behaviour in the following sense: for an arbitrary subset of $T\subset \mathbb{R}^n$, the distortion $\sup_{t \in T} | \|At\|_2^2 - \|t\|_2^2 |$ is almost the same as if $A$ were a Gaussian matrix. A simple outcome of our result is that if $X$ is a symmetric, isotropic, log-concave random vector and $n \leq m \leq c_1(α)n^α$ for some $α>1$, then with high probability, the extremal singular values of $A$ satisfy the optimal estimate: $1-c_2(α) \sqrt{n/m} \leq λ_{\rm min} \leq λ_{\rm max} \leq 1+c_2(α) \sqrt{n/m}$.

preprint2022arXiv

Sharp estimates on random hyperplane tessellations

We study the problem of generating a hyperplane tessellation of an arbitrary set $T$ in $\mathbb{R}^n$, ensuring that the Euclidean distance between any two points corresponds to the fraction of hyperplanes separating them up to a pre-specified error $δ$. We focus on random gaussian tessellations with uniformly distributed shifts and derive sharp bounds on the number of hyperplanes $m$ that are required. Surprisingly, our lower estimates falsify the conjecture that $m\sim \ell_*^2(T)/δ^2$, where $\ell_*^2(T)$ is the gaussian width of $T$, is optimal.

preprint2021arXiv

Column randomization and almost-isometric embeddings

The matrix $A:\mathbb{R}^n \to \mathbb{R}^m$ is $(δ,k)$-regular if for any $k$-sparse vector $x$, $$ \left| \|Ax\|_2^2-\|x\|_2^2\right| \leq δ\sqrt{k} \|x\|_2^2. $$ We show that if $A$ is $(δ,k)$-regular for $1 \leq k \leq 1/δ^2$, then by multiplying the columns of $A$ by independent random signs, the resulting random ensemble $A_ε$ acts on an arbitrary subset $T \subset \mathbb{R}^n$ (almost) as if it were gaussian, and with the optimal probability estimate: if $\ell_*(T)$ is the gaussian mean-width of $T$ and $d_T=\sup_{t \in T} \|t\|_2$, then with probability at least $1-2\exp(-c(\ell_*(T)/d_T)^2)$, $$ \sup_{t \in T} \left| \|A_εt\|_2^2-\|t\|_2^2 \right| \leq C\left(Λd_T δ\ell_*(T)+(δ\ell_*(T))^2 \right), $$ where $Λ=\max\{1,δ^2\log(nδ^2)\}$. This estimate is optimal for $0<δ\leq 1/\sqrt{\log n}$.

preprint2020arXiv

Approximating $L_p$ unit balls via random sampling

Let $X$ be an isotropic random vector in $R^d$ that satisfies that for every $v \in S^{d-1}$, $\|<X,v>\|_{L_q} \leq L \|<X,v>\|_{L_p}$ for some $q \geq 2p$. We show that for $0<\varepsilon<1$, a set of $N = c(p,q,\varepsilon) d$ random points, selected independently according to $X$, can be used to construct a $1 \pm \varepsilon$ approximation of the $L_p$ unit ball endowed on $R^d$ by $X$. Moreover, $c(p,q,\varepsilon) \leq c^p \varepsilon^{-2}\log(2/\varepsilon)$; when $q=2p$ the approximation is achieved with probability at least $1-2\exp(-cN \varepsilon^2/\log^2(2/\varepsilon))$ and if $q$ is much larger than $p$---say, $q=4p$, the approximation is achieved with probability at least $1-2\exp(-cN \varepsilon^2)$. In particular, when $X$ is a log-concave random vector, this estimate improves the previous state-of-the-art---that $N=c^\prime(p,\varepsilon) d^{p/2}\log d$ random points are enough, and that the approximation is valid with constant probability.

preprint2020arXiv

Extending the scope of the small-ball method

The small-ball method was introduced as a way of obtaining a high probability, isomorphic lower bound on the quadratic empirical process, under weak assumptions on the indexing class. The key assumption was that class members satisfy a uniform small-ball estimate: that $Pr(|f| \geq κ\|f\|_{L_2}) \geq δ$ for given constants $κ$ and $δ$. Here we extend the small-ball method and obtain a high probability, almost-isometric (rather than isomorphic) lower bound on the quadratic empirical process. The scope of the result is considerably wider than the small-ball method: there is no need for class members to satisfy a uniform small-ball condition, and moreover, motivated by the notion of tournament learning procedures, the result is stable under a `majority vote'.

preprint2020arXiv

Learning bounded subsets of $L_p$

We study learning problems in which the underlying class is a bounded subset of $L_p$ and the target $Y$ belongs to $L_p$. Previously, minimax sample complexity estimates were known under such boundedness assumptions only when $p=\infty$. We present a sharp sample complexity estimate that holds for any $p > 4$. It is based on a learning procedure that is suited for heavy-tailed problems.

preprint2020arXiv

Robust multivariate mean estimation: the optimality of trimmed mean

We consider the problem of estimating the mean of a random vector based on i.i.d. observations and adversarial contamination. We introduce a multivariate extension of the trimmed-mean estimator and show its optimal performance under minimal conditions.

preprint2017arXiv

Regularization and the small-ball method I: sparse recovery

We obtain bounds on estimation error rates for regularization procedures of the form \begin{equation*} \hat f \in {\rm argmin}_{f\in F}\left(\frac{1}{N}\sum_{i=1}^N\left(Y_i-f(X_i)\right)^2+λΨ(f)\right) \end{equation*} when $Ψ$ is a norm and $F$ is convex. Our approach gives a common framework that may be used in the analysis of learning problems and regularization problems alike. In particular, it sheds some light on the role various notions of sparsity have in regularization and on their connection with the size of subdifferentials of $Ψ$ in a neighbourhood of the true minimizer. As `proof of concept' we extend the known estimates for the LASSO, SLOPE and trace norm regularization.

preprint2016arXiv

Learning subgaussian classes : Upper and minimax bounds

We obtain sharp oracle inequalities for the empirical risk minimization procedure in the regression model under the assumption that the target Y and the model F are subgaussian. The bound we obtain is sharp in the minimax sense if F is convex. Moreover, under mild assumptions on F, the error rate of ERM remains optimal even if the procedure is allowed to perform with constant probability. A part of our analysis is a new proof of minimax results for the gaussian regression model.

preprint2016arXiv

On multiplier processes under weak moment assumptions

We show that if $V \subset \R^n$ satisfies a certain symmetry condition (closely related to unconditionaity) and if $X$ is an isotropic random vector for which $\|\inr{X,t}\|_{L_p} \leq L \sqrt{p}$ for every $t \in S^{n-1}$ and $p \lesssim \log n$, then the corresponding empirical and multiplier processes indexed by $V$ behave as if $X$ were $L$-subgaussian.

preprint2016arXiv

Performance of empirical risk minimization in linear aggregation

We study conditions under which, given a dictionary $F=\{f_1,\ldots ,f_M\}$ and an i.i.d. sample $(X_i,Y_i)_{i=1}^N$, the empirical minimizer in $\operatorname {span}(F)$ relative to the squared loss, satisfies that with high probability \[R\bigl(\tilde{f}^{\mathrm{ERM}}\bigr)\leq\inf_{f\in\operatorname {span}(F)}R(f)+r_N(M),\] where $R(\cdot)$ is the squared risk and $r_N(M)$ is of the order of $M/N$. Among other results, we prove that a uniform small-ball estimate for functions in $\operatorname {span}(F)$ is enough to achieve that goal when the noise is independent of the design.

preprint2016arXiv

Regularization and the small-ball method II: complexity dependent error rates

For a convex class of functions $F$, a regularization functions $Ψ(\cdot)$ and given the random data $(X_i, Y_i)_{i=1}^N$, we study estimation properties of regularization procedures of the form \begin{equation*} \hat f \in {\rm argmin}_{f\in F}\Big(\frac{1}{N}\sum_{i=1}^N\big(Y_i-f(X_i)\big)^2+λΨ(f)\Big) \end{equation*} for some well chosen regularization parameter $λ$. We obtain bounds on the $L_2$ estimation error rate that depend on the complexity of the "true model" $F^*:=\{f\in F: Ψ(f)\leqΨ(f^*)\}$, where $f^*\in {\rm argmin}_{f\in F}\mathbb{E}(Y-f(X))^2$ and the $(X_i,Y_i)$'s are independent and distributed as $(X,Y)$. Our estimate holds under weak stochastic assumptions -- one of which being a small-ball condition satisfied by $F$ -- and for rather flexible choices of regularization functions $Ψ(\cdot)$. Moreover, the result holds in the learning theory framework: we do not assume any a-priori connection between the output $Y$ and the input $X$. As a proof of concept, we apply our general estimation bound to various choices of $Ψ$, for example, the $\ell_p$ and $S_p$-norms (for $p\geq1$), weak-$\ell_p$, atomic norms, max-norm and SLOPE. In many cases, the estimation rate almost coincides with the minimax rate in the class $F^*$.

preprint2016arXiv

Risk minimization by median-of-means tournaments

We consider the classical statistical learning/regression problem, when the value of a real random variable Y is to be predicted based on the observation of another random variable X. Given a class of functions F and a sample of independent copies of (X, Y ), one needs to choose a function f from F such that f(X) approximates Y as well as possible, in the mean-squared sense. We introduce a new procedure, the so-called median-of-means tournament, that achieves the optimal tradeoff between accuracy and confidence under minimal assumptions, and in particular outperforms classical methods based on empirical risk minimization.

preprint2015arXiv

`local' vs. `global' parameters -- breaking the gaussian complexity barrier

We show that if $F$ is a convex class of functions that is $L$-subgaussian, the error rate of learning problems generated by independent noise is equivalent to a fixed point determined by `local' covering estimates of the class, rather than by the gaussian averages. To that end, we establish new sharp upper and lower estimates on the error rate for such problems.

preprint2015arXiv

On aggregation for heavy-tailed classes

We introduce an alternative to the notion of `fast rate' in Learning Theory, which coincides with the optimal error rate when the given class happens to be convex and regular in some sense. While it is well known that such a rate cannot always be attained by a learning procedure (i.e., a procedure that selects a function in the given class), we introduce an aggregation procedure that attains that rate under rather minimal assumptions -- for example, that the $L_q$ and $L_2$ norms are equivalent on the linear span of the class for some $q>2$, and the target random variable is square-integrable.

preprint2015arXiv

Sparse recovery under weak moment assumptions

We prove that iid random vectors that satisfy a rather weak moment assumption can be used as measurement vectors in Compressed Sensing, and the number of measurements required for exact reconstruction is the same as the best possible estimate -- exhibited by a random gaussian matrix. We also prove that this moment condition is necessary, up to a $\log \log $ factor. Applications to the Compatibility Condition and the Restricted Eigenvalue Condition in the noisy setup and to properties of neighbourly random polytopes are also discussed.

preprint2015arXiv

Upper bounds on product and multiplier empirical processes

We study two empirical process of special structure: firstly, the centred multiplier process indexed by a class $F$, $f \to \left|\sum_{i=1}^N (ξ_i f(X_i) - \E ξf)\right|$, where the i.i.d. multipliers $(ξ_i)_{i=1}^N$ need not be independent of $(X_i)_{i=1}^N$, and secondly, $(f,h) \to \left|\sum_{i=1}^N (f(X_i)h(X_i)-\E f h) \right|$, the centred product process indexed by the classes $F$ and $H$. We use chaining methods to obtain high probability upper bounds on the suprema of the two processes using a natural variation of Talagrand's $γ$-functionals.

preprint2014arXiv

Dvoretzky type theorems for subgaussian coordinate projections

Given a class of functions $F$ on a probability space $(Ω,μ)$, we study the structure of a typical coordinate projection of the class, defined by $\{(f(X_i))_{i=1}^N : f \in F\}$, where $X_1,...,X_N$ are independent, selected according to $μ$. This notion of projection generalizes the standard linear random projection used in Asymptotic Geometric Analysis. We show that when $F$ is a subgaussian class of functions, a typical coordinate projection satisfies a Dvoretzky type theorem.

preprint2014arXiv

Learning without Concentration

We obtain sharp bounds on the performance of Empirical Risk Minimization performed in a convex class and with respect to the squared loss, without assuming that class members and the target are bounded functions or have rapidly decaying tails. Rather than resorting to a concentration-based argument, the method used here relies on a `small-ball' assumption and thus holds for classes consisting of heavy-tailed functions and for heavy-tailed targets. The resulting estimates scale correctly with the `noise level' of the problem, and when applied to the classical, bounded scenario, always improve the known bounds.

preprint2014arXiv

Learning without Concentration for General Loss Functions

We study prediction and estimation problems using empirical risk minimization, relative to a general convex loss function. We obtain sharp error rates even when concentration is false or is very restricted, for example, in heavy-tailed scenarios. Our results show that the error rate depends on two parameters: one captures the intrinsic complexity of the class, and essentially leads to the error rate in a noise-free (or realizable) problem; the other measures interactions between class members the target and the loss, and is dominant when the problem is far from realizable. We also explain how one may deal with outliers by choosing the loss in a way that is calibrated to the intrinsic complexity of the class and to the noise-level of the problem (the latter is measured by the distance between the target and the class).

preprint2014arXiv

Necessary moment conditions for exact reconstruction via basis pursuit

Let $X=(x_1,...,x_n)$ be a random vector that satisfies a weak small ball property and whose coordinates $x_i$ satisfy that $\|x_i\|_{L_p} \lesssim \sqrt{p} \|x_i\|_{L_2}$ for $p \sim \log n$. In \cite{LM_compressed}, it was shown that $N$ independent copies of $X$ can be used as measurement vectors in Compressed Sensing (using the basis pursuit algorithm) to reconstruct any $d$-sparse vector with the optimal number of measurements $N\gtrsim d \log\big(e n/d\big)$. In this note we show that the result is almost optimal. We construct a random vector $X$ with iid, mean-zero, variance one coordinates that satisfies the same weak small ball property and whose coordinates satisfy that $\|x_i\|_{L_p} \lesssim \sqrt{p} \|x_i\|_{L_2}$ for $p \sim (\log n)/(\log N)$, but the basis pursuit algorithm fails to recover even $1$-sparse vectors. The construction shows that `spiky' measurement vectors may lead to a poor performance by the basis pursuit algorithm, but on the other hand may still perform in an optimal way if one chooses a different reconstruction algorithm (like $\ell_0$-minimization). This exhibits the fact that the convex relaxation of $\ell_0$-minimization comes at a significant cost when using `spiky' measurement vectors.

preprint2013arXiv

A remark on the diameter of random sections of convex bodies

We obtain a new upper estimate on the Euclidean diameter of the intersection of the kernel of a random matrix with iid rows with a given convex body. The proof is based on a small-ball argument rather than on concentration and thus the estimate holds for relatively general matrix ensembles.

preprint2013arXiv

Bounding the smallest singular value of a random matrix without concentration

Given $X$ a random vector in ${\mathbb{R}}^n$, set $X_1,...,X_N$ to be independent copies of $X$ and let $Γ=\frac{1}{\sqrt{N}}\sum_{i=1}^N <X_i,\cdot>e_i$ be the matrix whose rows are $\frac{X_1}{\sqrt{N}},\dots, \frac{X_N}{\sqrt{N}}$. We obtain sharp probabilistic lower bounds on the smallest singular value $λ_{\min}(Γ)$ in a rather general situation, and in particular, under the assumption that $X$ is an isotropic random vector for which $\sup_{t\in S^{n-1}}{\mathbb{E}}|<t,X>|^{2+η} \leq L$ for some $L,η>0$. Our results imply that a Bai-Yin type lower bound holds for $η>2$, and, up to a log-factor, for $η=2$ as well. The bounds hold without any additional assumptions on the Euclidean norm $\|X\|_{\ell_2^n}$. Moreover, we establish a nontrivial lower bound even without any higher moment assumptions (corresponding to the case $η=0$), if the linear forms satisfy a weak `small ball' property.

preprint2013arXiv

Minimax rate of convergence and the performance of ERM in phase recovery

We study the performance of Empirical Risk Minimization in noisy phase retrieval problems, indexed by subsets of $\R^n$ and relative to subgaussian sampling; that is, when the given data is $y_i=\inr{a_i,x_0}^2+w_i$ for a subgaussian random vector $a$, independent noise $w$ and a fixed but unknown $x_0$ that belongs to a given subset of $\R^n$. We show that ERM produces $\hat{x}$ whose Euclidean distance to either $x_0$ or $-x_0$ depends on the gaussian mean-width of the indexing set and on the signal-to-noise ratio of the problem. The bound coincides with the one for linear regression when $\|x_0\|_2$ is of the order of a constant. In addition, we obtain a minimax lower bound for the problem and identify sets for which ERM is a minimax procedure. As examples, we study the class of $d$-sparse vectors in $\R^n$ and the unit ball in $\ell_1^n$.

preprint2013arXiv

On the optimality of the aggregate with exponential weights for low temperatures

Given a finite class of functions F, the problem of aggregation is to construct a procedure with a risk as close as possible to the risk of the best element in the class. A classical procedure (PAC-Bayesian statistical learning theory (2004) Paris 6, Statistical Learning Theory and Stochastic Optimization (2001) Springer, Ann. Statist. 28 (2000) 75-87) is the aggregate with exponential weights (AEW), defined by \[\tilde{f}^{\mathrm{AEW}}=\sum_{f\in F}\hatθ(f)f,\qquad where \hatθ(f)=\frac{\exp(-({n}/{T})R_n(f))}{\sum_{g\in F}\exp(-({n}/{T})R_n(g))},\] where $T>0$ is called the temperature parameter and $R_n(\cdot)$ is an empirical risk. In this article, we study the optimality of the AEW in the regression model with random design and in the low-temperature regime. We prove three properties of AEW. First, we show that AEW is a suboptimal aggregation procedure in expectation with respect to the quadratic risk when $T\leq c_1$, where $c_1$ is an absolute positive constant (the low-temperature regime), and that it is suboptimal in probability even for high temperatures. Second, we show that as the cardinality of the dictionary grows, the behavior of AEW might deteriorate, namely, that in the low-temperature regime it might concentrate with high probability around elements in the dictionary with risk greater than the risk of the best function in the dictionary by at least an order of $1/\sqrt{n}$. Third, we prove that if a geometric condition on the dictionary (the so-called "Bernstein condition) is assumed, then AEW is indeed optimal both in high probability and in expectation in the low-temperature regime. Moreover, under that assumption, the complexity term is essentially the logarithm of the cardinality of the set of "almost minimizers" rather than the logarithm of the cardinality of the entire dictionary. This result holds for small values of the temperature parameter, thus complementing an analogous result for high temperatures.

preprint2013arXiv

Suprema of Chaos Processes and the Restricted Isometry Property

We present a new bound for suprema of a special type of chaos processes indexed by a set of matrices, which is based on a chaining method. As applications we show significantly improved estimates for the restricted isometry constants of partial random circulant matrices and time-frequency structured random matrices. In both cases the required condition on the number $m$ of rows in terms of the sparsity $s$ and the vector length $n$ is $m \gtrsim s \log^2 s \log^2 n$.

preprint2012arXiv

General nonexact oracle inequalities for classes with a subexponential envelope

We show that empirical risk minimization procedures and regularized empirical risk minimization procedures satisfy nonexact oracle inequalities in an unbounded framework, under the assumption that the class has a subexponential envelope function. The main novelty, in addition to the boundedness assumption free setup, is that those inequalities can yield fast rates even in situations in which exact oracle inequalities only hold with slower rates. We apply these results to show that procedures based on $\ell_1$ and nuclear norms regularization functions satisfy oracle inequalities with a residual term that decreases like $1/n$ for every $L_q$-loss functions ($q\geq2$), while only assuming that the tail behavior of the input and output variables are well behaved. In particular, no RIP type of assumption or "incoherence condition" are needed to obtain fast residual terms in those setups. We also apply these results to the problems of convex aggregation and model selection.

preprint2012arXiv

Phase Retrieval: Stability and Recovery Guarantees

We consider stability and uniqueness in real phase retrieval problems over general input sets. Specifically, we assume the data consists of noisy quadratic measurements of an unknown input x in R^n that lies in a general set T and study conditions under which x can be stably recovered from the measurements. In the noise-free setting we derive a general expression on the number of measurements needed to ensure that a unique solution can be found in a stable way, that depends on the set T through a natural complexity parameter. This parameter can be computed explicitly for many sets T of interest. For example, for k-sparse inputs we show that O(k\log(n/k)) measurements are needed, and when x can be any vector in R^n, O(n) measurements suffice. In the noisy case, we show that if one can find a value for which the empirical risk is bounded by a given, computable constant (that depends on the set T), then the error with respect to the true input is bounded above by an another, closely related complexity parameter of the set. By choosing an appropriate number N of measurements, this bound can be made arbitrarily small, and it decays at a rate faster than N^{-1/2+δ} for any δ>0. In particular, for k-sparse vectors stable recovery is possible from O(k\log(n/k)\log k) noisy measurements, and when x can be any vector in R^n, O(n \log n) noisy measurements suffice. We also show that the complexity parameter for the quadratic problem is the same as the one used for analyzing stability in linear measurements under very general conditions. Thus, no substantial price has to be paid in terms of stability if there is no knowledge of the phase.

preprint2011arXiv

Discrepancy, chaining and subgaussian processes

We show that for a typical coordinate projection of a subgaussian class of functions, the infimum over signs $\inf_{(ε_i)}{\sup_{f\in F}}|{\sum_{i=1}^kε_i}f(X_i)|$ is asymptotically smaller than the expectation over signs as a function of the dimension $k$, if the canonical Gaussian process indexed by $F$ is continuous. To that end, we establish a bound on the discrepancy of an arbitrary subset of $\mathbb {R}^k$ using properties of the canonical Gaussian process the set indexes, and then obtain quantitative structural information on a typical coordinate projection of a subgaussian class.

preprint2011arXiv

On generic chaining and the smallest singular value of random matrices with heavy tails

We present a very general chaining method which allows one to control the supremum of the empirical process $\sup_{h \in H} |N^{-1}\sum_{i=1}^N h^2(X_i)-\E h^2|$ in rather general situations. We use this method to establish two main results. First, a quantitative (non asymptotic) version of the classical Bai-Yin Theorem on the singular values of a random matrix with i.i.d entries that have heavy tails, and second, a sharp estimate on the quadratic empirical process when $H=\{\inr{t,\cdot} : t \in T\}$, $T \subset \R^n$ and $μ$ is an isotropic, unconditional, log-concave measure.

preprint2011arXiv

Sharper lower bounds on the performance of the empirical risk minimization algorithm

We present an argument based on the multidimensional and the uniform central limit theorems, proving that, under some geometrical assumptions between the target function $T$ and the learning class $F$, the excess risk of the empirical risk minimization algorithm is lower bounded by \[\frac{\mathbb{E}\sup_{q\in Q}G_q}{\sqrt{n}}δ,\] where $(G_q)_{q\in Q}$ is a canonical Gaussian process associated with $Q$ (a well chosen subset of $F$) and $δ$ is a parameter governing the oscillations of the empirical excess risk function over a small ball in $F$.

preprint2010arXiv

Empirical processes with bounded ψ_1 diameter

We study the empirical process indexed by F^2=\{f^2 : f \in F\}, where F is a class of mean-zero functions on a probability space. We present a sharp bound on the supremum of that process which depends on the ψ_1 diameter of the class F (rather than on the ψ_2 one) and on the complexity parameter γ_2(F,ψ_2). In addition, we present optimal bounds on the random diameters \sup_{f \in F} \max_{|I|=m} (\sum_{i \in I} f^2(X_i))^{1/2} using the same parameters. As applications, we extend several well known results in Asymptotic Geometric Analysis to any isotropic, log-concave ensemble on R^n.

preprint2010arXiv

Regularization in kernel learning

Under mild assumptions on the kernel, we obtain the best known error rates in a regularized learning scenario taking place in the corresponding reproducing kernel Hilbert space (RKHS). The main novelty in the analysis is a proof that one can use a regularization term that grows significantly slower than the standard quadratic growth in the RKHS norm.

Shahar Mendelson

What is connected

Connect this record

See the researcher in context

Building this map preview

35 published item(s)

Fast metric embedding into the Hamming cube

On Monte-Carlo methods in convex stochastic optimization

Random embeddings with an almost Gaussian distortion

Sharp estimates on random hyperplane tessellations

Column randomization and almost-isometric embeddings

Approximating $L_p$ unit balls via random sampling

Extending the scope of the small-ball method

Learning bounded subsets of $L_p$

Robust multivariate mean estimation: the optimality of trimmed mean

Regularization and the small-ball method I: sparse recovery

Learning subgaussian classes : Upper and minimax bounds

On multiplier processes under weak moment assumptions

Performance of empirical risk minimization in linear aggregation

Regularization and the small-ball method II: complexity dependent error rates

Risk minimization by median-of-means tournaments

`local' vs. `global' parameters -- breaking the gaussian complexity barrier

On aggregation for heavy-tailed classes

Sparse recovery under weak moment assumptions

Upper bounds on product and multiplier empirical processes

Dvoretzky type theorems for subgaussian coordinate projections

Learning without Concentration

Learning without Concentration for General Loss Functions

Necessary moment conditions for exact reconstruction via basis pursuit

A remark on the diameter of random sections of convex bodies

Bounding the smallest singular value of a random matrix without concentration

Minimax rate of convergence and the performance of ERM in phase recovery

On the optimality of the aggregate with exponential weights for low temperatures

Suprema of Chaos Processes and the Restricted Isometry Property

General nonexact oracle inequalities for classes with a subexponential envelope

Phase Retrieval: Stability and Recovery Guarantees

Discrepancy, chaining and subgaussian processes

On generic chaining and the smallest singular value of random matrices with heavy tails

Sharper lower bounds on the performance of the empirical risk minimization algorithm

Empirical processes with bounded ψ_1 diameter

Regularization in kernel learning