Source author record

Sara van de Geer

Sara van de Geer appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Methodology Machine Learning math.PR Applications Computation math.OC

Catalog footprint

What is connected

38works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Adaptive Rates for Total Variation Image Denoising

We study the theoretical properties of image denoising via total variation penalized least-squares. We define the total vatiation in terms of the two-dimensional total discrete derivative of the image and show that it gives rise to denoised images that are piecewise constant on rectangular sets. We prove that, if the true image is piecewise constant on just a few rectangular sets, the denoised image converges to the true image at a parametric rate, up to a log factor. More generally, we show that the denoised image enjoys oracle properties, that is, it is almost as good as if some aspects of the true image were known. In other words, image denoising with total variation regularization leads to an adaptive reconstruction of the true image.

preprint2021arXiv

Deep ReLU Programming

Feed-forward ReLU neural networks partition their input domain into finitely many "affine regions" of constant neuron activation pattern and affine behaviour. We analyze their mathematical structure and provide algorithmic primitives for an efficient application of linear programming related techniques for iterative minimization of such non-convex functions. In particular, we propose an extension of the Simplex algorithm which is iterating on induced vertices but, in addition, is able to change its feasible region computationally efficiently to adjacent "affine regions". This way, we obtain the Barrodale-Roberts algorithm for LAD regression as a special case, but also are able to train the first layer of neural networks with L1 training loss decreasing in every step.

preprint2021arXiv

Tensor denoising with trend filtering

We extend the notion of trend filtering to tensors by considering the $k^{\rm th}$-order Vitali variation, a discretized version of the integral of the absolute value of the $k^{\rm th}$-order total derivative. We prove adaptive $\ell^0$-rates and not-so-slow $\ell^1$-rates for tensor denoising with trend filtering. For $k=\{1,2,3,4\}$ we prove that the $d$-dimensional margin of a $d$-dimensional tensor can be estimated at the $\ell^0$-rate $n^{-1}$, up to logarithmic terms, if the underlying tensor is a product of $(k-1)^{\rm th}$-order polynomials on a constant number of hyperrectangles. For general $k$ we prove the $\ell^1$-rate of estimation $n^{- \frac{H(d)+2k-1}{2H(d)+2k-1}}$, up to logarithmic terms, where $H(d)$ is the $d^{\rm th}$ harmonic number. Thanks to an ANOVA-type of decomposition we can apply these results to the lower dimensional margins of the tensor to prove bounds for denoising the whole tensor. Our tools are interpolating tensors to bound the effective sparsity for $\ell^0$-rates, mesh grids for $\ell^1$-rates and, in the background, the projection arguments by Dalalyan et al.

preprint2020arXiv

A Framework for the construction of upper bounds on the number of affine linear regions of ReLU feed-forward neural networks

We present a framework to derive upper bounds on the number of regions that feed-forward neural networks with ReLU activation functions are affine linear on. It is based on an inductive analysis that keeps track of the number of such regions per dimensionality of their images within the layers. More precisely, the information about the number regions per dimensionality is pushed through the layers starting with one region of the input dimension of the neural network and using a recursion based on an analysis of how many regions per output dimensionality a subsequent layer with a certain width can induce on an input region with a given dimensionality. The final bound on the number of regions depends on the number and widths of the layers of the neural network and on some additional parameters that were used for the recursion. It is stated in terms of the $L1$-norm of the last column of a product of matrices and provides a unifying treatment of several previously known bounds: Depending on the choice of the recursion parameters that determine these matrices, it is possible to obtain the bounds from Montúfar (2014), (2017) and Serra et. al. (2017) as special cases. For the latter, which is the strongest of these bounds, the formulation in terms of matrices provides new insight. In particular, by using explicit formulas for a Jordan-like decomposition of the involved matrices, we achieve new tighter results for the asymptotic setting, where the number of layers of the same fixed width tends to infinity.

preprint2020arXiv

Logistic regression with total variation regularization

We study logistic regression with total variation penalty on the canonical parameter and show that the resulting estimator satisfies a sharp oracle inequality: the excess risk of the estimator is adaptive to the number of jumps of the underlying signal or an approximation thereof. In particular when there are finitely many jumps, and jumps up are sufficiently separated from jumps down, then the estimator converges with a parametric rate up to a logarithmic term $\log n / n$, provided the tuning parameter is chosen appropriately of order $1/ \sqrt n$. Our results extend earlier results for quadratic loss to logistic loss. We do not assume any a priori known bounds on the canonical parameter but instead only make use of the local curvature of the theoretical risk.

preprint2020arXiv

Prediction bounds for higher order total variation regularized least squares

We establish adaptive results for trend filtering: least squares estimation with a penalty on the total variation of $(k-1)^{\rm th}$ order differences. Our approach is based on combining a general oracle inequality for the $\ell_1$-penalized least squares estimator with "interpolating vectors" to upper-bound the "effective sparsity". This allows one to show that the $\ell_1$-penalty on the $k^{\text{th}}$ order differences leads to an estimator that can adapt to the number of jumps in the $(k-1)^{\text{th}}$ order differences of the underlying signal or an approximation thereof. We show the result for $k \in \{1,2,3,4\}$ and indicate how it could be derived for general $k\in \mathbb{N}$.

preprint2019arXiv

Oracle inequalities for square root analysis estimators with application to total variation penalties

Through the direct study of the analysis estimator we derive oracle inequalities with fast and slow rates by adapting the arguments involving projections by Dalalyan, Hebiri and Lederer (2017). We then extend the theory to the square root analysis estimator. Finally, we focus on (square root) total variation regularized estimators on graphs and obtain constant-friendly rates, which, up to log-terms, match previous results obtained by entropy calculations. We also obtain an oracle inequality for the (square root) total variation regularized estimator over the cycle graph.

preprint2016arXiv

Concentration behavior of the penalized least squares estimator

Consider the standard nonparametric regression model and take as estimator the penalized least squares function. In this article, we study the trade-off between closeness to the true function and complexity penalization of the estimator, where complexity is described by a seminorm on a class of functions. First, we present an exponential concentration inequality revealing the concentration behavior of the trade-off of the penalized least squares estimator around a nonrandom quantity, where such quantity depends on the problem under consideration. Then, under some conditions and for the proper choice of the tuning parameter, we obtain bounds for this nonrandom quantity. We illustrate our results with some examples that include the smoothing splines estimator.

preprint2016arXiv

Honest confidence regions and optimality in high-dimensional precision matrix estimation

We propose methodology for estimation of sparse precision matrices and statistical inference for their low-dimensional parameters in a high-dimensional setting where the number of parameters $p$ can be much larger than the sample size. We show that the novel estimator achieves minimax rates in supremum norm and the low-dimensional components of the estimator have a Gaussian limiting distribution. These results hold uniformly over the class of precision matrices with row sparsity of small order $\sqrt{n}/\log p$ and spectrum uniformly bounded, under a sub-Gaussian tail assumption on the margins of the true underlying distribution. Consequently, our results lead to uniformly valid confidence regions for low-dimensional parameters of the precision matrix. Thresholding the estimator leads to variable selection without imposing irrepresentability conditions. The performance of the method is demonstrated in a simulation study and on real data.

preprint2016arXiv

On concentration for (regularized) empirical risk minimization

Rates of convergence for empirical risk minimizers have been well studied in the literature. In this paper, we aim to provide a complementary set of results, in particular by showing that after normalization, the risk of the empirical minimizer concentrates on a single point. Such results have been established by~\cite{chatterjee2014new} for constrained estimators in the normal sequence model. We first generalize and sharpen this result to regularized least squares with convex penalties, making use of a "direct" argument based on Borell's theorem. We then study generalizations to other loss functions, including the negative log-likelihood for exponential families combined with a strictly convex regularization penalty. The results in this general setting are based on more "indirect" arguments as well as on concentration inequalities for maxima of empirical processes.

preprint2016arXiv

Sharp Oracle Inequalities for Square Root Regularization

We study a set of regularization methods for high-dimensional linear regression models. These penalized estimators have the square root of the residual sum of squared errors as loss function, and any weakly decomposable norm as penalty function. This fit measure is chosen because of its property that the estimator does not depend on the unknown standard deviation of the noise. On the other hand, a generalized weakly decomposable norm penalty is very useful in being able to deal with different underlying sparsity structures. We can choose a different sparsity inducing norm depending on how we want to interpret the unknown parameter vector $β$. Structured sparsity norms, as defined in Micchelli et al. [18], are special cases of weakly decomposable norms, therefore we also include the square root LASSO (Belloni et al. [3]), the group square root LASSO (Bunea et al. [10]) and a new method called the square root SLOPE (in a similar fashion to the SLOPE from Bogdan et al. [6]). For this collection of estimators our results provide sharp oracle inequalities with the Karush-Kuhn-Tucker conditions. We discuss some examples of estimators. Based on a simulation we illustrate some advantages of the square root SLOPE.

preprint2015arXiv

$χ^2$-confidence sets in high-dimensional regression

We study a high-dimensional regression model. Aim is to construct a confidence set for a given group of regression coefficients, treating all other regression coefficients as nuisance parameters. We apply a one-step procedure with the square-root Lasso as initial estimator and a multivariate square-root Lasso for constructing a surrogate Fisher information matrix. The multivariate square-root Lasso is based on nuclear norm loss with $\ell_1$-penalty. We show that this procedure leads to an asymptotically $χ^2$-distributed pivot, with a remainder term depending only on the $\ell_1$-error of the initial estimator. We show that under $\ell_1$-sparsity conditions on the regression coefficients $β^0$ the square-root Lasso produces to a consistent estimator of the noise variance and we establish sharp oracle inequalities which show that the remainder term is small under further sparsity conditions on $β^0$ and compatibility conditions on the design.

preprint2015arXiv

Confidence intervals for high-dimensional inverse covariance estimation

We propose methodology for statistical inference for low-dimensional parameters of sparse precision matrices in a high-dimensional setting. Our method leads to a non-sparse estimator of the precision matrix whose entries have a Gaussian limiting distribution. Asymptotic properties of the novel estimator are analyzed for the case of sub-Gaussian observations under a sparsity assumption on the entries of the true precision matrix and regularity conditions. Thresholding the de-sparsified estimator gives guarantees for edge selection in the associated graphical model. Performance of the proposed method is illustrated in a simulation study.

preprint2015arXiv

High-dimensional inference in misspecified linear models

We consider high-dimensional inference when the assumed linear model is misspecified. We describe some correct interpretations and corresponding sufficient assumptions for valid asymptotic inference of the model parameters, which still have a useful meaning when the model is misspecified. We largely focus on the de-sparsified Lasso procedure but we also indicate some implications for (multiple) sample splitting techniques. In view of available methods and software, our results contribute to robustness considerations with respect to model misspecification.

preprint2014arXiv

Censored linear model in high dimensions

Censored data are quite common in statistics and have been studied in depth in the last years. In this paper we consider censored high-dimensional data. High-dimensional models are in some way more complex than their low-dimensional versions, therefore some different techniques are required. For the linear case appropriate estimators based on penalized regression, have been developed in the last years. In particular in sparse contexts the $l_1$-penalised regression (also known as LASSO) performs very well. Only few theoretical work was done in order to analyse censored linear models in a high-dimensional context. We therefore consider a high-dimensional censored linear model, where the response variable is left-censored. We propose a new estimator, which aims to work with high-dimensional linear censored data. Theoretical non-asymptotic oracle inequalities are derived.

preprint2014arXiv

Discussion: "A significance test for the lasso"

Discussion of "A significance test for the lasso" by Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani [arXiv:1301.7161].

preprint2014arXiv

New concentration inequalities for suprema of empirical processes

While effective concentration inequalities for suprema of empirical processes exist under boundedness or strict tail assumptions, no comparable results have been available under considerably weaker assumptions. In this paper, we derive concentration inequalities assuming only low moments for an envelope of the empirical process. These concentration inequalities are beneficial even when the envelope is much larger than the single functions under consideration.

preprint2014arXiv

On asymptotically optimal confidence regions and tests for high-dimensional models

We propose a general method for constructing confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in a high-dimensional model. It can be easily adjusted for multiplicity taking dependence among tests into account. For linear models, our method is essentially the same as in Zhang and Zhang [J. R. Stat. Soc. Ser. B Stat. Methodol. 76 (2014) 217-242]: we analyze its asymptotic properties and establish its asymptotic optimality in terms of semiparametric efficiency. Our method naturally extends to generalized linear models with convex loss functions. We develop the corresponding theory which includes a careful analysis for Gaussian, sub-Gaussian and bounded correlated designs.

preprint2014arXiv

On higher order isotropy conditions and lower bounds for sparse quadratic forms

This study aims at contributing to lower bounds for empirical compatibility constants or empirical restricted eigenvalues. This is of importance in compressed sensing and theory for $\ell_1$-regularized estimators. Let $X$ be an $n \times p$ data matrix with rows being independent copies of a $p$-dimensional random variable. Let $\hat Σ:= X^T X / n$ be the inner product matrix. We show that the quadratic forms $u^T \hat Σu$ are lower bounded by a value converging to one, uniformly over the set of vectors $u$ with $u^T Σ_0 u $ equal to one and $\ell_1$-norm at most $M$. Here $Σ_0 := {\bf E} \hat Σ$ is the theoretical inner product matrix which we assume to exist. The constant $M$ is required to be of small order $\sqrt {n / \log p}$. We assume moreover $m$-th order isotropy for some $m >2$ and sub-exponential tails or moments up to order $\log p$ for the entries in $X$. As a consequence we obtain convergence of the empirical compatibility constant to its theoretical counterpart, and similarly for the empirical restricted eigenvalue. If the data matrix $X$ is first normalized so that its columns all have equal length we obtain lower bounds assuming only isotropy and no further moment conditions on its entries. The isotropy condition is shown to hold for certain martingale situations.

preprint2014arXiv

Statistical Theory for High-Dimensional Models

These lecture notes consist of three chapters. In the first chapter we present oracle inequalities for the prediction error of the Lasso and square-root Lasso and briefly describe the scaled Lasso. In the second chapter we establish asymptotic linearity of a de-sparsified Lasso. This implies asymptotic normality under certain conditions and therefore can be used to construct confidence intervals for parameters of interest. A similar line of reasoning can be invoked to derive bounds in sup-norm for the Lasso and asymptotic linearity of de-sparsified estimators of a precision matrix. In the third chapter we consider chaining and the more general generic chaining method developed by Talagrand. This allows one to bound suprema of random processes. Concentration inequalities are refined probability inequalities, mostly again for suprema of random processes. We combine the two. We prove a deviation inequality directly using (generic) chaining.

preprint2014arXiv

The additive model with different smoothness for the components

We consider an additive regression model consisting of two components $f^0$ and $g^0$, where the first component $f^0$ is in some sense "smoother" than the second $g^0$. Smoothness is here described in terms of a semi-norm on the class of regression functions. We use a penalized least squares estimator $(\hat f, \hat g)$ of $(f^0, g^0)$ and show that the rate of convergence for $\hat f $ is faster than the rate of convergence for $\hat g$. In fact, both rates are generally as fast as in the case where one of the two components is known. The theory is illustrated by a simulation study. Our proofs rely on recent results from empirical process theory.

preprint2014arXiv

Worst possible sub-directions in high-dimensional models

We examine the rate of convergence of the Lasso estimator of lower dimensional components of the high-dimensional parameter. Under bounds on the $\ell_1$-norm on the worst possible sub-direction these rates are of order $\sqrt {|J| \log p / n }$ where $p$ is the total number of parameters, $J \subset \{ 1, \ldots, p \}$ represents a subset of the parameters and $n$ is the number of observations. We also derive rates in sup-norm in terms of the rate of convergence in $\ell_1$-norm. The irrepresentable condition on a set $J$ requires that the $\ell_1$-norm of the worst possible sub-direction is sufficiently smaller than one. In that case sharp oracle results can be obtained. Moreover, if the coefficients in $J$ are small enough the Lasso will put these coefficients to zero. This extends known results which say that the irrepresentable condition on the inactive set (the set where coefficients are exactly zero) implies no false positives. We further show that by de-sparsifying one obtains fast rates in supremum norm without conditions on the worst possible sub-direction. The main assumption here is that approximate sparsity is of order $o (\sqrt n / \log p )$. The results are extended to M-estimation with $\ell_1$-penalty for generalized linear models and exponential families for example. For the graphical Lasso this leads to an extension of known results to the case where the precision matrix is only approximately sparse. The bounds we provide are non-asymptotic but we also present asymptotic formulations for ease of interpretation.

preprint2013arXiv

$\ell_0$-penalized maximum likelihood for sparse directed acyclic graphs

We consider the problem of regularized maximum likelihood estimation for the structure and parameters of a high-dimensional, sparse directed acyclic graphical (DAG) model with Gaussian distribution, or equivalently, of a Gaussian structural equation model. We show that the $\ell_0$-penalized maximum likelihood estimator of a DAG has about the same number of edges as the minimal-edge I-MAP (a DAG with minimal number of edges representing the distribution), and that it converges in Frobenius norm. We allow the number of nodes p to be much larger than sample size n but assume a sparsity condition and that any representation of the true DAG has at least a fixed proportion of its nonzero edge weights above the noise level. Our results do not rely on the faithfulness assumption nor on the restrictive strong faithfulness condition which are required for methods based on conditional independence testing such as the PC-algorithm.

preprint2013arXiv

Confidence sets in sparse regression

The problem of constructing confidence sets in the high-dimensional linear model with $n$ response variables and $p$ parameters, possibly $p\ge n$, is considered. Full honest adaptive inference is possible if the rate of sparse estimation does not exceed $n^{-1/4}$, otherwise sparse adaptive confidence sets exist only over strict subsets of the parameter spaces for which sparse estimators exist. Necessary and sufficient conditions for the existence of confidence sets that adapt to a fixed sparsity level of the parameter vector are given in terms of minimal $\ell^2$-separation conditions on the parameter space. The design conditions cover common coherence assumptions used in models for sparsity, including (possibly correlated) sub-Gaussian designs.

preprint2013arXiv

On the uniform convergence of empirical norms and inner products, with application to causal inference

Uniform convergence of empirical norms - empirical measures of squared functions - is a topic which has received considerable attention in the literature on empirical processes. The results are relevant as empirical norms occur due to symmetrization. They also play a prominent role in statistical applications. The contraction inequality has been a main tool but recently other approaches have shown to lead to better results in important cases. We present an overview including the linear (anisotropic) case, and give new results for inner products of functions. Our main application will be the estimation of the parental structure in a directed acyclic graph. As intermediate result we establish convergence of the least squares estimator when the model is wrong.

preprint2013arXiv

Quasi-Likelihood and/or Robust Estimation in High Dimensions

We consider the theory for the high-dimensional generalized linear model with the Lasso. After a short review on theoretical results in literature, we present an extension of the oracle results to the case of quasi-likelihood loss. We prove bounds for the prediction error and $\ell_1$-error. The results are derived under fourth moment conditions on the error distribution. The case of robust loss is also given. We moreover show that under an irrepresentable condition, the $\ell_1$-penalized quasi-likelihood estimator has no false positives.

preprint2013arXiv

The partial linear model in high dimensions

Partial linear models have been widely used as flexible method for modelling linear components in conjunction with non-parametric ones. Despite the presence of the non-parametric part, the linear, parametric part can under certain conditions be estimated with parametric rate. In this paper, we consider a high-dimensional linear part. We show that it can be estimated with oracle rates, using the LASSO penalty for the linear part and a smoothness penalty for the nonparametric part.

preprint2012arXiv

Correlated variables in regression: clustering and sparse estimation

We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results.

preprint2012arXiv

Generic chaining and the l1-penalty

We address the choice of the tuning parameter $λ$ in $\ell_1$-penalized M-estimation. Our main concern is models which are highly nonlinear, such as the Gaussian mixture model. The number of parameters $p$ is moreover large, possibly larger than the number of observations $n$. The generic chaining technique of Talagrand[2005] is tailored for this problem. It leads to the choice $λ\asymp \sqrt {\log p / n}$, as in the standard Lasso procedure (which concerns the linear model and least squares loss).

preprint2012arXiv

L1-Penalization for Mixture Regression Models

We consider a finite mixture of regressions (FMR) model for high-dimensional inhomogeneous data where the number of covariates may be much larger than sample size. We propose an l1-penalized maximum likelihood estimator in an appropriate parameterization. This kind of estimation belongs to a class of problems where optimization and theory for non-convex functions is needed. This distinguishes itself very clearly from high-dimensional estimation with convex loss- or objective functions, as for example with the Lasso in linear or generalized linear models. Mixture models represent a prime and important example where non-convexity arises. For FMR models, we develop an efficient EM algorithm for numerical optimization with provable convergence properties. Our penalized estimator is numerically better posed (e.g., boundedness of the criterion function) than unpenalized maximum likelihood estimation, and it allows for effective statistical regularization including variable selection. We also present some asymptotic theory and oracle inequalities: due to non-convexity of the negative log-likelihood function, different mathematical arguments are needed than for problems with convex losses. Finally, we apply the new method to both simulated and real data.

preprint2012arXiv

Weakly decomposable regularization penalties and structured sparsity

It has been shown in literature that the Lasso estimator, or l1-penalized least squares estimator, enjoys good oracle properties. This paper examines which special properties of the l1-penalty allow for sharp oracle results, and then extends the situation to general norm-based penalties that satisfy a weak decomposability condition.

preprint2011arXiv

The Bernstein-Orlicz norm and deviation inequalities

We introduce two new concepts designed for the study of empirical processes. First, we introduce a new Orlicz norm which we call the Bernstein-Orlicz norm. This new norm interpolates sub-Gaussian and sub-exponential tail behavior. In particular, we show how this norm can be used to simplify the derivation of deviation inequalities for suprema of collections of random variables. Secondly, we introduce chaining and generic chaining along a tree. These simplify the well-known concepts of chaining and generic chaining. The supremum of the empirical process is then studied as a special case. We show that chaining along a tree can be done using entropy with bracketing. Finally, we establish a deviation inequality for the empirical process for the unbounded case.

preprint2011arXiv

The Lasso, correlated design, and improved oracle inequalities

We study high-dimensional linear models and the $\ell_1$-penalized least squares estimator, also known as the Lasso estimator. In literature, oracle inequalities have been derived under restricted eigenvalue or compatibility conditions. In this paper, we complement this with entropy conditions which allow one to improve the dual norm bound, and demonstrate how this leads to new oracle inequalities. The new oracle inequalities show that a smaller choice for the tuning parameter and a trade-off between $\ell_1$-norms and small compatibility constants are possible. This implies, in particular for correlated design, improved bounds for the prediction error of the Lasso estimator as compared to the methods based on restricted eigenvalue or compatibility conditions only.

preprint2010arXiv

Estimation for High-Dimensional Linear Mixed-Effects Models Using $\ell_1$-Penalization

We propose an $\ell_1$-penalized estimation procedure for high-dimensional linear mixed-effects models. The models are useful whenever there is a grouping structure among high-dimensional observations, i.e. for clustered data. We prove a consistency and an oracle optimality result and we develop an algorithm with provable numerical convergence. Furthermore, we demonstrate the performance of the method on simulated and a real high-dimensional data set.

preprint2010arXiv

Oracle Inequalities and Optimal Inference under Group Sparsity

We consider the problem of estimating a sparse linear regression vector $β^*$ under a gaussian noise model, for the purpose of both prediction and model selection. We assume that prior knowledge is available on the sparsity pattern, namely the set of variables is partitioned into prescribed groups, only few of which are relevant in the estimation process. This group sparsity assumption suggests us to consider the Group Lasso method as a means to estimate $β^*$. We establish oracle inequalities for the prediction and $\ell_2$ estimation errors of this estimator. These bounds hold under a restricted eigenvalue condition on the design matrix. Under a stronger coherence condition, we derive bounds for the estimation error for mixed $(2,p)$-norms with $1\le p\leq \infty$. When $p=\infty$, this result implies that a threshold version of the Group Lasso estimator selects the sparsity pattern of $β^*$ with high probability. Next, we prove that the rate of convergence of our upper bounds is optimal in a minimax sense, up to a logarithmic factor, for all estimators over a class of group sparse vectors. Furthermore, we establish lower bounds for the prediction and $\ell_2$ estimation errors of the usual Lasso estimator. Using this result, we demonstrate that the Group Lasso can achieve an improvement in the prediction and estimation properties as compared to the Lasso.

preprint2010arXiv

The adaptive and the thresholded Lasso for potentially misspecified models

We revisit the adaptive Lasso as well as the thresholded Lasso with refitting, in a high-dimensional linear model, and study prediction error, $\ell_q$-error ($q \in \{1, 2 \} $), and number of false positive selections. Our theoretical results for the two methods are, at a rather fine scale, comparable. The differences only show up in terms of the (minimal) restricted and sparse eigenvalues, favoring thresholding over the adaptive Lasso. As regards prediction and estimation, the difference is virtually negligible, but our bound for the number of false positives is larger for the adaptive Lasso than for thresholding. Moreover, both these two-stage methods add value to the one-stage Lasso in the sense that, under appropriate restricted and sparse eigenvalue conditions, they have similar prediction and estimation error as the one-stage Lasso, but substantially less false positives.

preprint2009arXiv

Nemirovski's Inequalities Revisited

An important tool for statistical research are moment inequalities for sums of independent random vectors. Nemirovski and coworkers (1983, 2000) derived one particular type of such inequalities: For certain Banach spaces $(\B,\|\cdot\|)$ there exists a constant $K = K(\B,\|\cdot\|)$ such that for arbitrary independent and centered random vectors $X_1, X_2, ..., X_n \in \B$, their sum $S_n$ satisfies the inequality $ E \|S_n \|^2 \le K \sum_{i=1}^n E \|X_i\|^2$. We present and compare three different approaches to obtain such inequalities: Nemirovski's results are based on deterministic inequalities for norms. Another possible vehicle are type and cotype inequalities, a tool from probability theory on Banach spaces. Finally, we use a truncation argument plus Bernstein's inequality to obtain another version of the moment inequality above. Interestingly, all three approaches have their own merits.

preprint2009arXiv

Taking Advantage of Sparsity in Multi-Task Learning

We study the problem of estimating multiple linear regression equations for the purpose of both prediction and variable selection. Following recent work on multi-task learning Argyriou et al. [2008], we assume that the regression vectors share the same sparsity pattern. This means that the set of relevant predictor variables is the same across the different equations. This assumption leads us to consider the Group Lasso as a candidate estimation method. We show that this estimator enjoys nice sparsity oracle inequalities and variable selection properties. The results hold under a certain restricted eigenvalue condition and a coherence condition on the design matrix, which naturally extend recent work in Bickel et al. [2007], Lounici [2008]. In particular, in the multi-task learning scenario, in which the number of tasks can grow, we are able to remove completely the effect of the number of predictor variables in the bounds. Finally, we show how our results can be extended to more general noise distributions, of which we only require the variance to be finite.

Sara van de Geer

What is connected

Connect this record

See the researcher in context

Building this map preview

38 published item(s)

Adaptive Rates for Total Variation Image Denoising

Deep ReLU Programming

Tensor denoising with trend filtering

A Framework for the construction of upper bounds on the number of affine linear regions of ReLU feed-forward neural networks

Logistic regression with total variation regularization

Prediction bounds for higher order total variation regularized least squares

Oracle inequalities for square root analysis estimators with application to total variation penalties

Concentration behavior of the penalized least squares estimator

Honest confidence regions and optimality in high-dimensional precision matrix estimation

On concentration for (regularized) empirical risk minimization

Sharp Oracle Inequalities for Square Root Regularization

$χ^2$-confidence sets in high-dimensional regression

Confidence intervals for high-dimensional inverse covariance estimation

High-dimensional inference in misspecified linear models

Censored linear model in high dimensions

Discussion: "A significance test for the lasso"

New concentration inequalities for suprema of empirical processes

On asymptotically optimal confidence regions and tests for high-dimensional models

On higher order isotropy conditions and lower bounds for sparse quadratic forms

Statistical Theory for High-Dimensional Models

The additive model with different smoothness for the components

Worst possible sub-directions in high-dimensional models

$\ell_0$-penalized maximum likelihood for sparse directed acyclic graphs

Confidence sets in sparse regression

On the uniform convergence of empirical norms and inner products, with application to causal inference

Quasi-Likelihood and/or Robust Estimation in High Dimensions

The partial linear model in high dimensions

Correlated variables in regression: clustering and sparse estimation

Generic chaining and the l1-penalty

L1-Penalization for Mixture Regression Models

Weakly decomposable regularization penalties and structured sparsity

The Bernstein-Orlicz norm and deviation inequalities

The Lasso, correlated design, and improved oracle inequalities

Estimation for High-Dimensional Linear Mixed-Effects Models Using $\ell_1$-Penalization

Oracle Inequalities and Optimal Inference under Group Sparsity

The adaptive and the thresholded Lasso for potentially misspecified models

Nemirovski's Inequalities Revisited

Taking Advantage of Sparsity in Multi-Task Learning