Source author record

Alexandre B. Tsybakov

Alexandre B. Tsybakov appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Machine Learning Applications Methodology

Catalog footprint

What is connected

21works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Benign overfitting and adaptive nonparametric regression

In the nonparametric regression setting, we construct an estimator which is a continuous function interpolating the data points with high probability, while attaining minimax optimal rates under mean squared risk on the scale of Hölder classes adaptively to the unknown smoothness.

preprint2021arXiv

Variable selection, monotone likelihood ratio and group sparsity

In the pivotal variable selection problem, we derive the exact non-asymptotic minimax selector over the class of all $s$-sparse vectors, which is also the Bayes selector with respect to the uniform prior. While this optimal selector is, in general, not realizable in polynomial time, we show that its tractable counterpart (the scan selector) attains the minimax expected Hamming risk to within factor 2, and is also exact minimax with respect to the probability of wrong recovery. As a consequence, we establish explicit lower bounds under the monotone likelihood ratio property and we obtain a tight characterization of the minimax risk in terms of the best separable selector risk. We apply these general results to derive necessary and sufficient conditions of exact and almost full recovery in the location model with light tail distributions and in the problem of group variable selection under Gaussian noise.

preprint2020arXiv

Adaptive robust estimation in sparse vector model

For the sparse vector model, we consider estimation of the target vector, of its L2-norm and of the noise variance. We construct adaptive estimators and establish the optimal rates of adaptive estimation when adaptation is considered with respect to the triplet "noise level - noise distribution - sparsity". We consider classes of noise distributions with polynomially and exponentially decreasing tails as well as the case of Gaussian noise. The obtained rates turn out to be different from the minimax non-adaptive rates when the triplet is known. A crucial issue is the ignorance of the noise variance. Moreover, knowing or not knowing the noise distribution can also influence the rate. For example, the rates of estimation of the noise variance can differ depending on whether the noise is Gaussian or sub-Gaussian without a precise knowledge of the distribution. Estimation of noise variance in our setting can be viewed as an adaptive variant of robust estimation of scale in the contamination model, where instead of fixing the "nominal" distribution in advance, we assume that it belongs to some class of distributions.

preprint2016arXiv

Bounds on the prediction error of penalized least squares estimators with convex penalty

This paper considers the penalized least squares estimator with arbitrary convex penalty. When the observation noise is Gaussian, we show that the prediction error is a subgaussian random variable concentrated around its median. We apply this concentration property to derive sharp oracle inequalities for the prediction error of the LASSO, the group LASSO and the SLOPE estimators, both in probability and in expectation. In contrast to the previous work on the LASSO type methods, our oracle inequalities in probability are obtained at any confidence level for estimators with tuning parameters that do not depend on the confidence level. This is also the reason why we are able to establish sparsity oracle bounds in expectation for the LASSO type estimators, while the previously known techniques did not allow for the control of the expected risk. In addition, we show that the concentration rate in the oracle inequalities is better than it was commonly understood before.

preprint2016arXiv

Nuclear norm penalization and optimal rates for noisy low rank matrix completion

This paper deals with the trace regression model where $n$ entries or linear combinations of entries of an unknown $m_1\times m_2$ matrix $A_0$ corrupted by noise are observed. We propose a new nuclear norm penalized estimator of $A_0$ and establish a general sharp oracle inequality for this estimator for arbitrary values of $n,m_1,m_2$ under the condition of isometry in expectation. Then this method is applied to the matrix completion problem. In this case, the estimator admits a simple explicit form and we prove that it satisfies oracle inequalities with faster rates of convergence than in the previous works. They are valid, in particular, in the high-dimensional setting $m_1m_2\gg n$. We show that the obtained rates are optimal up to logarithmic factors in a minimax sense and also derive, for any fixed matrix $A_0$, a non-minimax lower bound on the rate of convergence of our estimator, which coincides with the upper bound up to a constant factor. Finally, we show that our procedure provides an exact recovery of the rank of $A_0$ with probability close to 1. We also discuss the statistical learning setting where there is no underlying model determined by $A_0$ and the aim is to find the best trace regression model approximating the data.

preprint2016arXiv

Robust Matrix Completion

This paper considers the problem of recovery of a low-rank matrix in the situation when most of its entries are not observed and a fraction of observed entries are corrupted. The observations are noisy realizations of the sum of a low rank matrix, which we wish to recover, with a second matrix having a complementary sparse structure such as element-wise or column-wise sparsity. We analyze a class of estimators obtained by solving a constrained convex optimization problem that combines the nuclear norm and a convex relaxation for a sparse constraint. Our results are obtained for the simultaneous presence of random and deterministic patterns in the sampling scheme. We provide guarantees for recovery of low-rank and sparse components from partial and corrupted observations in the presence of noise and show that the obtained rates of convergence are minimax optimal.

preprint2015arXiv

Minimax estimation of linear and quadratic functionals on sparsity classes

For the Gaussian sequence model, we obtain non-asymptotic minimax rates of estimation of the linear, quadratic and the L2-norm functionals on classes of sparse vectors and construct optimal estimators that attain these rates. The main object of interest is the class s-sparse vectors for which we also provide completely adaptive estimators (independent of s and of the noise variance) having only logarithmically slower rates than the minimax ones. Furthermore, we obtain the minimax rates on the Lq-balls where 0 < q < 2. This analysis shows that there are, in general, three zones in the rates of convergence that we call the sparse zone, the dense zone and the degenerate zone, while a fourth zone appears for estimation of the quadratic functional. We show that, as opposed to estimation of the vector, the correct logarithmic terms in the optimal rates for the sparse zone scale as log(d/s^2) and not as log(d/s). For the sparse class, the rates of estimation of the linear functional and of the L2-norm have a simple elbow at s = sqrt(d) (boundary between the sparse and the dense zones) and exhibit similar performances, whereas the estimation of the quadratic functional reveals more complex effects and is not possible only on the basis of sparsity described by the sparsity condition on the vector. Finally, we apply our results on estimation of the L2-norm to the problem of testing against sparse alternatives. In particular, we obtain a non-asymptotic analog of the Ingster-Donoho-Jin theory revealing some effects that were not captured by the previous asymptotic analysis.

preprint2015arXiv

Sharp oracle bounds for monotone and convex regression through aggregation

We derive oracle inequalities for the problems of isotonic and convex regression using the combination of $Q$-aggregation procedure and sparsity pattern aggregation. This improves upon the previous results including the oracle inequalities for the constrained least squares estimator. One of the improvements is that our oracle inequalities are sharp, i.e., with leading constant 1. It allows us to obtain bounds for the minimax regret thus accounting for model misspecification, which was not possible based on the previous results. Another improvement is that we obtain oracle inequalities both with high probability and in expectation.

preprint2014arXiv

An $\{l_1,l_2,l_{\infty}\}$-Regularization Approach to High-Dimensional Errors-in-variables Models

Several new estimation methods have been recently proposed for the linear regression model with observation error in the design. Different assumptions on the data generating process have motivated different estimators and analysis. In particular, the literature considered (1) observation errors in the design uniformly bounded by some $\bar δ$, and (2) zero mean independent observation errors. Under the first assumption, the rates of convergence of the proposed estimators depend explicitly on $\bar δ$, while the second assumption has been applied when an estimator for the second moment of the observational error is available. This work proposes and studies two new estimators which, compared to other procedures for regression models with errors in the design, exploit an additional $l_{\infty}$-norm regularization. The first estimator is applicable when both (1) and (2) hold but does not require an estimator for the second moment of the observational error. The second estimator is applicable under (2) and requires an estimator for the second moment of the observation error. Importantly, we impose no assumption on the accuracy of this pilot estimator, in contrast to the previously known procedures. As the recent proposals, we allow the number of covariates to be much larger than the sample size. We establish the rates of convergence of the estimators and compare them with the bounds obtained for related estimators in the literature. These comparisons show interesting insights on the interplay of the assumptions and the achievable rates of convergence.

preprint2013arXiv

Sparse Estimation by Exponential Weighting

Consider a regression model with fixed design and Gaussian noise where the regression function can potentially be well approximated by a function that admits a sparse representation in a given dictionary. This paper resorts to exponential weights to exploit this underlying sparsity by implementing the principle of sparsity pattern aggregation. This model selection take on sparse estimation allows us to derive sparsity oracle inequalities in several popular frameworks, including ordinary sparsity, fused sparsity and group sparsity. One striking aspect of these theoretical results is that they hold under no condition in the dictionary. Moreover, we describe an efficient implementation of the sparsity pattern aggregation principle that compares favorably to state-of-the-art procedures on some basic numerical examples.

preprint2012arXiv

Mirror averaging with sparsity priors

We consider the problem of aggregating the elements of a possibly infinite dictionary for building a decision procedure that aims at minimizing a given criterion. Along with the dictionary, an independent identically distributed training sample is available, on which the performance of a given procedure can be tested. In a fairly general set-up, we establish an oracle inequality for the Mirror Averaging aggregate with any prior distribution. By choosing an appropriate prior, we apply this oracle inequality in the context of prediction under sparsity assumption for the problems of regression with random design, density estimation and binary classification.

preprint2011arXiv

Estimation of high-dimensional low-rank matrices

Suppose that we observe entries or, more generally, linear combinations of entries of an unknown $m\times T$-matrix $A$ corrupted by noise. We are particularly interested in the high-dimensional setting where the number $mT$ of unknown entries can be much larger than the sample size $N$. Motivated by several applications, we consider estimation of matrix $A$ under the assumption that it has small rank. This can be viewed as dimension reduction or sparsity assumption. In order to shrink toward a low-rank representation, we investigate penalized least squares estimators with a Schatten-$p$ quasi-norm penalty term, $p\leq1$. We study these estimators under two possible assumptions---a modified version of the restricted isometry condition and a uniform bound on the ratio "empirical norm induced by the sampling operator/Frobenius norm." The main results are stated as nonasymptotic upper bounds on the prediction risk and on the Schatten-$q$ risk of the estimators, where $q\in[p,2]$. The rates that we obtain for the prediction risk are of the form $rm/N$ (for $m=T$), up to logarithmic factors, where $r$ is the rank of $A$. The particular examples of multi-task learning and matrix completion are worked out in detail. The proofs are based on tools from the theory of empirical processes. As a by-product, we derive bounds for the $k$th entropy numbers of the quasi-convex Schatten class embeddings $S_p^M\hookrightarrow S_2^M$, $p<1$, which are of independent interest.

preprint2011arXiv

Fast learning rates for plug-in classifiers under the margin condition

It has been recently shown that, under the margin (or low noise) assumption, there exist classifiers attaining fast rates of convergence of the excess Bayes risk, i.e., the rates faster than $n^{-1/2}$. The works on this subject suggested the following two conjectures: (i) the best achievable fast rate is of the order $n^{-1}$, and (ii) the plug-in classifiers generally converge slower than the classifiers based on empirical risk minimization. We show that both conjectures are not correct. In particular, we construct plug-in classifiers that can achieve not only the fast, but also the {\it super-fast} rates, i.e., the rates faster than $n^{-1}$. We establish minimax lower bounds showing that the obtained rates cannot be improved.

preprint2011arXiv

Improved Matrix Uncertainty Selector

We consider the regression model with observation error in the design: y=Xθ* + e, Z=X+N. Here the random vector y in R^n and the random n*p matrix Z are observed, the n*p matrix X is unknown, N is an n*p random noise matrix, e in R^n is a random noise vector, and θ* is a vector of unknown parameters to be estimated. We consider the setting where the dimension p can be much larger than the sample size n and θ* is sparse. Because of the presence of the noise matrix N, the commonly used Lasso and Dantzig selector are unstable. An alternative procedure called the Matrix Uncertainty (MU) selector has been proposed in Rosenbaum and Tsybakov (2010) in order to account for the noise. The properties of the MU selector have been studied in Rosenbaum and Tsybakov (2010) for sparse θ* under the assumption that the noise matrix N is deterministic and its values are small. In this paper, we propose a modification of the MU selector when N is a random matrix with zero-mean entries having the variances that can be estimated. This is, for example, the case in the model where the entries of X are missing at random. We show both theoretically and numerically that, under these conditions, the new estimator called the Compensated MU selector achieves better accuracy of estimation than the original MU selector.

preprint2010arXiv

Detection boundary in sparse regression

We study the problem of detection of a p-dimensional sparse vector of parameters in the linear regression model with Gaussian noise. We establish the detection boundary, i.e., the necessary and sufficient conditions for the possibility of successful detection as both the sample size n and the dimension p tend to the infinity. Testing procedures that achieve this boundary are also exhibited. Our results encompass the high-dimensional setting (p>> n). The main message is that, under some conditions, the detection boundary phenomenon that has been proved for the Gaussian sequence model, extends to high-dimensional linear regression. Finally, we establish the detection boundaries when the variance of the noise is unknown. Interestingly, the detection boundaries sometimes depend on the knowledge of the variance in a high-dimensional setting.

preprint2010arXiv

Oracle Inequalities and Optimal Inference under Group Sparsity

We consider the problem of estimating a sparse linear regression vector $β^*$ under a gaussian noise model, for the purpose of both prediction and model selection. We assume that prior knowledge is available on the sparsity pattern, namely the set of variables is partitioned into prescribed groups, only few of which are relevant in the estimation process. This group sparsity assumption suggests us to consider the Group Lasso method as a means to estimate $β^*$. We establish oracle inequalities for the prediction and $\ell_2$ estimation errors of this estimator. These bounds hold under a restricted eigenvalue condition on the design matrix. Under a stronger coherence condition, we derive bounds for the estimation error for mixed $(2,p)$-norms with $1\le p\leq \infty$. When $p=\infty$, this result implies that a threshold version of the Group Lasso estimator selects the sparsity pattern of $β^*$ with high probability. Next, we prove that the rate of convergence of our upper bounds is optimal in a minimax sense, up to a logarithmic factor, for all estimators over a class of group sparse vectors. Furthermore, we establish lower bounds for the prediction and $\ell_2$ estimation errors of the usual Lasso estimator. Using this result, we demonstrate that the Group Lasso can achieve an improvement in the prediction and estimation properties as compared to the Lasso.

preprint2010arXiv

Simultaneous analysis of Lasso and Dantzig selector

We exhibit an approximate equivalence between the Lasso estimator and Dantzig selector. For both methods we derive parallel oracle inequalities for the prediction risk in the general nonparametric regression model, as well as bounds on the $\ell_p$ estimation loss for $1\le p\le 2$ in the linear model when the number of variables can be much larger than the sample size.

preprint2010arXiv

SPADES and mixture models

This paper studies sparse density estimation via $\ell_1$ penalization (SPADES). We focus on estimation in high-dimensional mixture models and nonparametric adaptive density estimation. We show, respectively, that SPADES can recover, with high probability, the unknown components of a mixture of probability densities and that it yields minimax adaptive density estimates. These results are based on a general sparsity oracle inequality that the SPADES estimates satisfy. We offer a data driven method for the choice of the tuning parameter used in the construction of SPADES. The method uses the generalized bisection method first introduced in \citebb09. The suggested procedure bypasses the need for a grid search and offers substantial computational savings. We complement our theoretical results with a simulation study that employs this method for approximations of one and two-dimensional densities with mixtures. The numerical results strongly support our theoretical findings.

preprint2010arXiv

Sparse recovery under matrix uncertainty

We consider the model {eqnarray*}y=Xθ^*+ξ, Z=X+Ξ,{eqnarray*} where the random vector $y\in\mathbb{R}^n$ and the random $n\times p$ matrix $Z$ are observed, the $n\times p$ matrix $X$ is unknown, $Ξ$ is an $n\times p$ random noise matrix, $ξ\in\mathbb{R}^n$ is a noise independent of $Ξ$, and $θ^*$ is a vector of unknown parameters to be estimated. The matrix uncertainty is in the fact that $X$ is observed with additive error. For dimensions $p$ that can be much larger than the sample size $n$, we consider the estimation of sparse vectors $θ^*$. Under matrix uncertainty, the Lasso and Dantzig selector turn out to be extremely unstable in recovering the sparsity pattern (i.e., of the set of nonzero components of $θ^*$), even if the noise level is very small. We suggest new estimators called matrix uncertainty selectors (or, shortly, the MU-selectors) which are close to $θ^*$ in different norms and in the prediction risk if the restricted eigenvalue assumption on $X$ is satisfied. We also show that under somewhat stronger assumptions, these estimators recover correctly the sparsity pattern.

preprint2010arXiv

Sparse Regression Learning by Aggregation and Langevin Monte-Carlo

We consider the problem of regression learning for deterministic design and independent random errors. We start by proving a sharp PAC-Bayesian type bound for the exponentially weighted aggregate (EWA) under the expected squared empirical loss. For a broad class of noise distributions the presented bound is valid whenever the temperature parameter $β$ of the EWA is larger than or equal to $4σ^2$, where $σ^2$ is the noise variance. A remarkable feature of this result is that it is valid even for unbounded regression functions and the choice of the temperature parameter depends exclusively on the noise level. Next, we apply this general bound to the problem of aggregating the elements of a finite-dimensional linear space spanned by a dictionary of functions $ϕ_1,...,ϕ_M$. We allow $M$ to be much larger than the sample size $n$ but we assume that the true regression function can be well approximated by a sparse linear combination of functions $ϕ_j$. Under this sparsity scenario, we propose an EWA with a heavy tailed prior and we show that it satisfies a sparsity oracle inequality with leading constant one. Finally, we propose several Langevin Monte-Carlo algorithms to approximately compute such an EWA when the number $M$ of aggregated functions can be large. We discuss in some detail the convergence of these algorithms and present numerical experiments that confirm our theoretical findings.

preprint2009arXiv

Taking Advantage of Sparsity in Multi-Task Learning

We study the problem of estimating multiple linear regression equations for the purpose of both prediction and variable selection. Following recent work on multi-task learning Argyriou et al. [2008], we assume that the regression vectors share the same sparsity pattern. This means that the set of relevant predictor variables is the same across the different equations. This assumption leads us to consider the Group Lasso as a candidate estimation method. We show that this estimator enjoys nice sparsity oracle inequalities and variable selection properties. The results hold under a certain restricted eigenvalue condition and a coherence condition on the design matrix, which naturally extend recent work in Bickel et al. [2007], Lounici [2008]. In particular, in the multi-task learning scenario, in which the number of tasks can grow, we are able to remove completely the effect of the number of predictor variables in the bounds. Finally, we show how our results can be extended to more general noise distributions, of which we only require the variance to be finite.

Alexandre B. Tsybakov

What is connected

Connect this record

See the researcher in context

Building this map preview

21 published item(s)

Benign overfitting and adaptive nonparametric regression

Variable selection, monotone likelihood ratio and group sparsity

Adaptive robust estimation in sparse vector model

Bounds on the prediction error of penalized least squares estimators with convex penalty

Nuclear norm penalization and optimal rates for noisy low rank matrix completion

Robust Matrix Completion

Minimax estimation of linear and quadratic functionals on sparsity classes

Sharp oracle bounds for monotone and convex regression through aggregation

An $\{l_1,l_2,l_{\infty}\}$-Regularization Approach to High-Dimensional Errors-in-variables Models

Sparse Estimation by Exponential Weighting

Mirror averaging with sparsity priors

Estimation of high-dimensional low-rank matrices

Fast learning rates for plug-in classifiers under the margin condition

Improved Matrix Uncertainty Selector

Detection boundary in sparse regression

Oracle Inequalities and Optimal Inference under Group Sparsity

Simultaneous analysis of Lasso and Dantzig selector

SPADES and mixture models

Sparse recovery under matrix uncertainty

Sparse Regression Learning by Aggregation and Langevin Monte-Carlo

Taking Advantage of Sparsity in Multi-Task Learning