Source author record

Karim Lounici

Karim Lounici appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Machine Learning Computation Artificial Intelligence eess.SP math.OC math.PR quant-ph

Catalog footprint

What is connected

23works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Geometric Dictionary Learning of Dynamical Systems with Optimal Transport

Learning dynamical systems through operator-theoretic representations provides a powerful framework for analyzing complex dynamics, as spectral quantities such as eigenvalues and invariant structures encode characteristic time scales and long-term behavior. However, dynamical operators are typically estimated independently for each system, preventing the discovery of shared structure across related dynamics. To address this limitation, we posit that related dynamical systems lie near a low-dimensional manifold in spectral operator space. Based on this hypothesis, we introduce DOODL (Dynamical OperatOr Dictionary Learning), a framework that learns a dictionary of characteristic spectral dynamics whose combinations approximate this manifold and yield compact, interpretable embeddings of individual systems. Beyond representation learning, DOODL enables fast and interpretable operator estimation from short and partially observed trajectories by constraining the estimation to the learned operator manifold. Experiments on metastable Langevin dynamics and turbulent plasma simulations demonstrate that DOODL scales to highly complex multiscale regimes while capturing characteristic spectral structure governing the dynamics rather than merely fitting trajectories, achieving errors one to two orders of magnitude lower than independent operator estimation methods in challenging low-data regimes.

preprint2026arXiv

kooplearn: A Scikit-Learn Compatible Library of Algorithms for Evolution Operator Learning

kooplearn is a machine-learning library that implements linear, kernel, and deep-learning estimators of dynamical operators and their spectral decompositions. kooplearn can model both discrete-time evolution operators (Koopman/Transfer) and continuous-time infinitesimal generators. By learning these operators, users can analyze dynamical systems via spectral methods, derive data-driven reduced-order models, and forecast future states and observables. kooplearn's interface is compliant with the scikit-learn API, facilitating its integration into existing machine learning and data science workflows. Additionally, kooplearn includes curated benchmark datasets to support experimentation, reproducibility, and the fair comparison of learning algorithms. The software is available at https://github.com/Machine-Learning-Dynamical-Systems/kooplearn.

preprint2022arXiv

AdaCap: Adaptive Capacity control for Feed-Forward Neural Networks

The capacity of a ML model refers to the range of functions this model can approximate. It impacts both the complexity of the patterns a model can learn but also memorization, the ability of a model to fit arbitrary labels. We propose Adaptive Capacity (AdaCap), a training scheme for Feed-Forward Neural Networks (FFNN). AdaCap optimizes the capacity of FFNN so it can capture the high-level abstract representations underlying the problem at hand without memorizing the training dataset. AdaCap is the combination of two novel ingredients, the Muddling labels for Regularization (MLR) loss and the Tikhonov operator training scheme. The MLR loss leverages randomly generated labels to quantify the propensity of a model to memorize. We prove that the MLR loss is an accurate in-sample estimator for out-of-sample generalization performance and that it can be used to perform Hyper-Parameter Optimization provided a Signal-to-Noise Ratio condition is met. The Tikhonov operator training scheme modulates the capacity of a FFNN in an adaptive, differentiable and data-dependent manner. We assess the effectiveness of AdaCap in a setting where DNN are typically prone to memorization, small tabular datasets, and benchmark its performance against popular machine learning methods.

preprint2022arXiv

Meta Representation Learning with Contextual Linear Bandits

Meta-learning seeks to build algorithms that rapidly learn how to solve new learning problems based on previous experience. In this paper we investigate meta-learning in the setting of stochastic linear bandit tasks. We assume that the tasks share a low dimensional representation, which has been partially acquired from previous learning tasks. We aim to leverage this information in order to learn a new downstream bandit task, which shares the same representation. Our principal contribution is to show that if the learned representation estimates well the unknown one, then the downstream task can be efficiently learned by a greedy policy that we propose in this work. We derive an upper bound on the regret of this policy, which is, up to logarithmic factors, of order $r\sqrt{N}(1\vee \sqrt{d/T})$, where $N$ is the horizon of the downstream task, $T$ is the number of training tasks, $d$ the ambient dimension and $r \ll d$ the dimension of the representation. We highlight that our strategy does not need to know $r$. We note that if $T> d$ our bound achieves the same rate of optimal minimax bandit algorithms using the true underlying representation. Our analysis is inspired and builds in part upon previous work on meta-learning in the i.i.d. full information setting \citep{tripuraneni2021provable,boursier2022trace}. As a separate contribution we show how to relax certain assumptions in those works, thereby improving their representation learning and risk analysis.

preprint2022arXiv

Sliding window strategy for convolutional spike sorting with Lasso : Algorithm, theoretical guarantees and complexity

Spike sorting is a class of algorithms used in neuroscience to attribute the time occurences of particular electric signals, called action potential or spike, to neurons. We rephrase this problem as a particular optimization problem : Lasso for convolutional models in high dimension. Lasso (i.e. least absolute shrinkage and selection operator) is a very generic tool in machine learning that help us to look for sparse solutions (here the time occurrences). However, for the size of the problem at hand in this neuroscience context, the classical Lasso solvers are failing. We present here a new and much faster algorithm. Making use of biological properties related to neurons, we explain how the particular structure of the problem allows several optimizations, leading to an algorithm with a temporal complexity which grows linearly with respect to the size of the recorded signal and can be performed online. Moreover the spatial separability of the initial problem allows to break it into subproblems, further reducing the complexity and making possible its application on the latest recording devices which comprise a large number of sensors. We provide several mathematical results: the size and numerical complexity of the subproblems can be estimated mathematically by using percolation theory. We also show under reasonable assumptions that the Lasso estimator retrieves the true time occurrences of the spikes {with large probability}. Finally the theoretical time complexity of the algorithm is given. Numerical simulations are also provided in order to illustrate the efficiency of our approach.

preprint2021arXiv

Muddling Labels for Regularization, a novel approach to generalization

Generalization is a central problem in Machine Learning. Indeed most prediction methods require careful calibration of hyperparameters usually carried out on a hold-out \textit{validation} dataset to achieve generalization. The main goal of this paper is to introduce a novel approach to achieve generalization without any data splitting, which is based on a new risk measure which directly quantifies a model's tendency to overfit. To fully understand the intuition and advantages of this new approach, we illustrate it in the simple linear regression model ($Y=Xβ+ξ$) where we develop a new criterion. We highlight how this criterion is a good proxy for the true generalization risk. Next, we derive different procedures which tackle several structures simultaneously (correlation, sparsity,...). Noticeably, these procedures \textbf{concomitantly} train the model and calibrate the hyperparameters. In addition, these procedures can be implemented via classical gradient descent methods when the criterion is differentiable w.r.t. the hyperparameters. Our numerical experiments reveal that our procedures are computationally feasible and compare favorably to the popular approach (Ridge, LASSO and Elastic-Net combined with grid-search cross-validation) in term of generalization. They also outperform the baseline on two additional tasks: estimation and support recovery of $β$. Moreover, our procedures do not require any expertise for the calibration of the initial parameters which remain the same for all the datasets we experimented on.

preprint2020arXiv

Optimizing generalization on the train set: a novel gradient-based framework to train parameters and hyperparameters simultaneously

Generalization is a central problem in Machine Learning. Most prediction methods require careful calibration of hyperparameters carried out on a hold-out \textit{validation} dataset to achieve generalization. The main goal of this paper is to present a novel approach based on a new measure of risk that allows us to develop novel fully automatic procedures for generalization. We illustrate the pertinence of this new framework in the regression problem. The main advantages of this new approach are: (i) it can simultaneously train the model and perform regularization in a single run of a gradient-based optimizer on all available data without any previous hyperparameter tuning; (ii) this framework can tackle several additional objectives simultaneously (correlation, sparsity,...) $via$ the introduction of regularization parameters. Noticeably, our approach transforms hyperparameter tuning as well as feature selection (a combinatorial discrete optimization problem) into a continuous optimization problem that is solvable via classical gradient-based methods ; (iii) the computational complexity of our methods is $O(npK)$ where $n,p,K$ denote respectively the number of observations, features and iterations of the gradient descent algorithm. We observe in our experiments a significantly smaller runtime for our methods as compared to benchmark methods for equivalent prediction score. Our procedures are implemented in PyTorch (code is available for replication).

preprint2016arXiv

New asymptotic results in principal component analysis

Let $X$ be a mean zero Gaussian random vector in a separable Hilbert space ${\mathbb H}$ with covariance operator $Σ:={\mathbb E}(X\otimes X).$ Let $Σ=\sum_{r\geq 1}μ_r P_r$ be the spectral decomposition of $Σ$ with distinct eigenvalues $μ_1>μ_2> \dots$ and the corresponding spectral projectors $P_1, P_2, \dots.$ Given a sample $X_1,\dots, X_n$ of size $n$ of i.i.d. copies of $X,$ the sample covariance operator is defined as $\hat Σ_n := n^{-1}\sum_{j=1}^n X_j\otimes X_j.$ The main goal of principal component analysis is to estimate spectral projectors $P_1, P_2, \dots$ by their empirical counterparts $\hat P_1, \hat P_2, \dots$ properly defined in terms of spectral decomposition of the sample covariance operator $\hat Σ_n.$ The aim of this paper is to study asymptotic distributions of important statistics related to this problem, in particular, of statistic $\|\hat P_r-P_r\|_2^2,$ where $\|\cdot\|_2^2$ is the squared Hilbert--Schmidt norm. This is done in a "high-complexity" asymptotic framework in which the so called effective rank ${\bf r}(Σ):=\frac{{\rm tr}(Σ)}{\|Σ\|_{\infty}}$ (${\rm tr}(\cdot)$ being the trace and $\|\cdot\|_{\infty}$ being the operator norm) of the true covariance $Σ$ is becoming large simultaneously with the sample size $n,$ but ${\bf r}(Σ)=o(n)$ as $n\to\infty.$ In this setting, we prove that, in the case of one-dimensional spectral projector $P_r,$ the properly centered and normalized statistic $\|\hat P_r-P_r\|_2^2$ with {\it data-dependent} centering and normalization converges in distribution to a Cauchy type limit. The proofs of this and other related results rely on perturbation analysis and Gaussian concentration.

preprint2016arXiv

Nuclear norm penalization and optimal rates for noisy low rank matrix completion

This paper deals with the trace regression model where $n$ entries or linear combinations of entries of an unknown $m_1\times m_2$ matrix $A_0$ corrupted by noise are observed. We propose a new nuclear norm penalized estimator of $A_0$ and establish a general sharp oracle inequality for this estimator for arbitrary values of $n,m_1,m_2$ under the condition of isometry in expectation. Then this method is applied to the matrix completion problem. In this case, the estimator admits a simple explicit form and we prove that it satisfies oracle inequalities with faster rates of convergence than in the previous works. They are valid, in particular, in the high-dimensional setting $m_1m_2\gg n$. We show that the obtained rates are optimal up to logarithmic factors in a minimax sense and also derive, for any fixed matrix $A_0$, a non-minimax lower bound on the rate of convergence of our estimator, which coincides with the upper bound up to a constant factor. Finally, we show that our procedure provides an exact recovery of the rank of $A_0$ with probability close to 1. We also discuss the statistical learning setting where there is no underlying model determined by $A_0$ and the aim is to find the best trace regression model approximating the data.

preprint2016arXiv

Robust Matrix Completion

This paper considers the problem of recovery of a low-rank matrix in the situation when most of its entries are not observed and a fraction of observed entries are corrupted. The observations are noisy realizations of the sum of a low rank matrix, which we wish to recover, with a second matrix having a complementary sparse structure such as element-wise or column-wise sparsity. We analyze a class of estimators obtained by solving a constrained convex optimization problem that combines the nuclear norm and a convex relaxation for a sparse constraint. Our results are obtained for the simultaneous presence of random and deterministic patterns in the sampling scheme. We provide guarantees for recovery of low-rank and sparse components from partial and corrupted observations in the presence of noise and show that the obtained rates of convergence are minimax optimal.

preprint2015arXiv

Asymptotics and Concentration Bounds for Bilinear Forms of Spectral Projectors of Sample Covariance

Let $X,X_1,\dots, X_n$ be i.i.d. Gaussian random variables with zero mean and covariance operator $Σ={\mathbb E}(X\otimes X)$ taking values in a separable Hilbert space ${\mathbb H}.$ Let $$ {\bf r}(Σ):=\frac{{\rm tr}(Σ)}{\|Σ\|_{\infty}} $$ be the effective rank of $Σ,$ ${\rm tr}(Σ)$ being the trace of $Σ$ and $\|Σ\|_{\infty}$ being its operator norm. Let $$\hat Σ_n:=n^{-1}\sum_{j=1}^n (X_j\otimes X_j)$$ be the sample (empirical) covariance operator based on $(X_1,\dots, X_n).$ The paper deals with a problem of estimation of spectral projectors of the covariance operator $Σ$ by their empirical counterparts, the spectral projectors of $\hat Σ_n$ (empirical spectral projectors). The focus is on the problems where both the sample size $n$ and the effective rank ${\bf r}(Σ)$ are large. This framework includes and generalizes well known high-dimensional spiked covariance models. Given a spectral projector $P_r$ corresponding to an eigenvalue $μ_r$ of covariance operator $Σ$ and its empirical counterpart $\hat P_r,$ we derive sharp concentration bounds for bilinear forms of empirical spectral projector $\hat P_r$ in terms of sample size $n$ and effective dimension ${\bf r}(Σ).$ Building upon these concentration bounds, we prove the asymptotic normality of bilinear forms of random operators $\hat P_r -{\mathbb E}\hat P_r$ under the assumptions that $n\to \infty$ and ${\bf r}(Σ)=o(n).$ In a special case of eigenvalues of multiplicity one, these results are rephrased as concentration bounds and asymptotic normality for linear forms of empirical eigenvectors. Other results include bounds on the bias ${\mathbb E}\hat P_r-P_r$ and a method of bias reduction as well as a discussion of possible applications to statistical inference in high-dimensional principal component analysis.

preprint2015arXiv

Estimation of Low-Rank Covariance Function

We consider the problem of estimating a low rank covariance function $K(t,u)$ of a Gaussian process $S(t), t\in [0,1]$ based on $n$ i.i.d. copies of $S$ observed in a white noise. We suggest a new estimation procedure adapting simultaneously to the low rank structure and the smoothness of the covariance function. The new procedure is based on nuclear norm penalization and exhibits superior performances as compared to the sample covariance function by a polynomial factor in the sample size $n$. Other results include a minimax lower bound for estimation of low-rank covariance functions showing that our procedure is optimal as well as a scheme to estimate the unknown noise variance of the Gaussian process.

preprint2015arXiv

Minimax and adaptive estimation of the Wigner function in quantum homodyne tomography with noisy data

In quantum optics, the quantum state of a light beam is represented through the Wigner function, a density on $\mathbb R^2$ which may take negative values but must respect intrinsic positivity constraints imposed by quantum physics. In the framework of noisy quantum homodyne tomography with efficiency parameter $1/2 < η\leq 1$, we study the theoretical performance of a kernel estimator of the Wigner function. We prove that it is minimax efficient, up to a logarithmic factor in the sample size, for the $\mathbb L_\infty$-risk over a class of infinitely differentiable. We compute also the lower bound for the $\mathbb L_2$-risk. We construct adaptive estimator, i.e. which does not depend on the smoothness parameters, and prove that it attains the minimax rates for the corresponding smoothness class functions. Finite sample behaviour of our adaptive procedure are explored through numerical experiments.

preprint2015arXiv

Normal approximation and concentration of spectral projectors of sample covariance

Let $X,X_1,\dots, X_n$ be i.i.d. Gaussian random variables in a separable Hilbert space ${\mathbb H}$ with zero mean and covariance operator $Σ={\mathbb E}(X\otimes X),$ and let $\hat Σ:=n^{-1}\sum_{j=1}^n (X_j\otimes X_j)$ be the sample (empirical) covariance operator based on $(X_1,\dots, X_n).$ Denote by $P_r$ the spectral projector of $Σ$ corresponding to its $r$-th eigenvalue $μ_r$ and by $\hat P_r$ the empirical counterpart of $P_r.$ The main goal of the paper is to obtain tight bounds on $$ \sup_{x\in {\mathbb R}} \left|{\mathbb P}\left\{\frac{\|\hat P_r-P_r\|_2^2-{\mathbb E}\|\hat P_r-P_r\|_2^2}{{\rm Var}^{1/2}(\|\hat P_r-P_r\|_2^2)}\leq x\right\}-Φ(x)\right|, $$ where $\|\cdot\|_2$ denotes the Hilbert--Schmidt norm and $Φ$ is the standard normal distribution function. Such accuracy of normal approximation of the distribution of squared Hilbert--Schmidt error is characterized in terms of so called effective rank of $Σ$ defined as ${\bf r}(Σ)=\frac{{\rm tr}(Σ)}{\|Σ\|_{\infty}},$ where ${\rm tr}(Σ)$ is the trace of $Σ$ and $\|Σ\|_{\infty}$ is its operator norm, as well as another parameter characterizing the size of ${\rm Var}(\|\hat P_r-P_r\|_2^2).$ Other results include non-asymptotic bounds and asymptotic representations for the mean squared Hilbert--Schmidt norm error ${\mathbb E}\|\hat P_r-P_r\|_2^2$ and the variance ${\rm Var}(\|\hat P_r-P_r\|_2^2),$ and concentration inequalities for $\|\hat P_r-P_r\|_2^2$ around its expectation.

preprint2014arXiv

Concentration Inequalities and Moment Bounds for Sample Covariance Operators

Let $X,X_1,\dots, X_n,\dots$ be i.i.d. centered Gaussian random variables in a separable Banach space $E$ with covariance operator $Σ:$ $$ Σ:E^{\ast}\mapsto E,\ \ Σu = {\mathbb E}\langle X,u\rangle, u\in E^{\ast}. $$ The sample covariance operator $\hat Σ:E^{\ast}\mapsto E$ is defined as $$ \hat Σu := n^{-1}\sum_{j=1}^n \langle X_j,u\rangle X_j, u\in E^{\ast}. $$ The goal of the paper is to obtain concentration inequalities and expectation bounds for the operator norm $\|\hat Σ-Σ\|$ of the deviation of the sample covariance operator from the true covariance operator. In particular, it is shown that $$ {\mathbb E}\|\hat Σ-Σ\|\asymp \|Σ\|\biggl(\sqrt{\frac{{\bf r}(Σ)}{n}}\bigvee \frac{{\bf r}(Σ)}{n}\biggr), $$ where $$ {\bf r}(Σ):=\frac{\Bigl({\mathbb E}\|X\|\Bigr)^2}{\|Σ\|}. $$ Moreover, under the assumption that ${\bf r}(Σ)\lesssim n,$ it is proved that, for all $t\geq 1,$ with probability at least $1-e^{-t}$ \begin{align*} \Bigl|\|\hatΣ- Σ\|-{\mathbb E}\|\hatΣ- Σ\|\Bigr| \lesssim \|Σ\|\biggl(\sqrt{\frac{t}{n}}\bigvee \frac{t}{n}\biggr). \end{align*}

preprint2012arXiv

High-dimensional covariance matrix estimation with missing observations

In this paper, we study the problem of high-dimensional approximately low-rank covariance matrix estimation with missing observations. We propose a simple procedure computationally tractable in high-dimension and that does not require imputation of the missing data. We establish non-asymptotic sparsity oracle inequalities for the estimation of the covariance matrix with the Frobenius and spectral norms, valid for any setting of the sample size and the dimension of the observations. We further establish minimax lower bounds showing that our rates are minimax optimal up to a logarithmic factor.

preprint2012arXiv

Sparse Principal Component Analysis with missing observations

In this paper, we study the problem of sparse Principal Component Analysis (PCA) in the high-dimensional setting with missing observations. Our goal is to estimate the first principal component when we only have access to partial observations. Existing estimation techniques are usually derived for fully observed data sets and require a prior knowledge of the sparsity of the first principal component in order to achieve good statistical guarantees. Our contributions is threefold. First, we establish the first information-theoretic lower bound for the sparse PCA problem with missing observations. Second, we propose a simple procedure that does not require any prior knowledge on the sparsity of the unknown first principal component or any imputation of the missing observations, adapts to the unknown sparsity of the first principal component and achieves the optimal rate of estimation up to a logarithmic factor. Third, if the covariance matrix of interest admits a sparse first principal component and is in addition approximately low-rank, then we can derive a completely data-driven procedure computationally tractable in high-dimension, adaptive to the unknown sparsity of the first principal component and statistically optimal (up to a logarithmic factor).

preprint2012arXiv

Variable Selection with Exponential Weights and $l_0$-Penalization

In the context of a linear model with a sparse coefficient vector, exponential weights methods have been shown to be achieve oracle inequalities for prediction. We show that such methods also succeed at variable selection and estimation under the necessary identifiability condition on the design matrix, instead of much stronger assumptions required by other methods such as the Lasso or the Dantzig Selector. The same analysis yields consistency results for Bayesian methods and BIC-type variable selection under similar conditions.

preprint2011arXiv

Global uniform risk bounds for wavelet deconvolution estimators

We consider the statistical deconvolution problem where one observes $n$ replications from the model $Y=X+ε$, where $X$ is the unobserved random signal of interest and $ε$ is an independent random error with distribution $ϕ$. Under weak assumptions on the decay of the Fourier transform of $ϕ,$ we derive upper bounds for the finite-sample sup-norm risk of wavelet deconvolution density estimators $f_n$ for the density $f$ of $X$, where $f:\mathbb{R}\to \mathbb{R}$ is assumed to be bounded. We then derive lower bounds for the minimax sup-norm risk over Besov balls in this estimation problem and show that wavelet deconvolution density estimators attain these bounds. We further show that linear estimators adapt to the unknown smoothness of $f$ if the Fourier transform of $ϕ$ decays exponentially and that a corresponding result holds true for the hard thresholding wavelet estimator if $ϕ$ decays polynomially. We also analyze the case where $f$ is a "supersmooth"/analytic density. We finally show how our results and recent techniques from Rademacher processes can be applied to construct global confidence bands for the density $f$.

preprint2011arXiv

Optimal spectral norm rates for noisy low-rank matrix completion

In this paper we consider the trace regression model where $n$ entries or linear combinations of entries of an unknown $m_1\times m_2$ matrix $A_0$ corrupted by noise are observed. We establish for the nuclear-norm penalized estimator of $A_0$ introduced in \cite{KLT} a general sharp oracle inequality with the spectral norm for arbitrary values of $n,m_1,m_2$ under an incoherence condition on the sampling distribution $Π$ of the observed entries. Then, we apply this method to the matrix completion problem. In this case, we prove that it satisfies an optimal oracle inequality for the spectral norm, thus improving upon the only existing result \cite{KLT} concerning the spectral norm, which assumes that the sampling distribution is uniform. Note that our result is valid, in particular, in the high-dimensional setting $m_1m_2\gg n$. Finally we show that the obtained rate is optimal up to logarithmic factors in a minimax sense.

preprint2011arXiv

Pac-bayesian bounds for sparse regression estimation with exponential weights

We consider the sparse regression model where the number of parameters $p$ is larger than the sample size $n$. The difficulty when considering high-dimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The BIC estimator for instance performs well from the statistical point of view \cite{BTW07} but can only be computed for values of $p$ of at most a few tens. The Lasso estimator is solution of a convex minimization problem, hence computable for large value of $p$. However stringent conditions on the design are required to establish fast rates of convergence for this estimator. Dalalyan and Tsybakov \cite{arnak} propose a method achieving a good compromise between the statistical and computational aspects of the problem. Their estimator can be computed for reasonably large $p$ and satisfies nice statistical properties under weak assumptions on the design. However, \cite{arnak} proposes sparsity oracle inequalities in expectation for the empirical excess risk only. In this paper, we propose an aggregation procedure similar to that of \cite{arnak} but with improved statistical performances. Our main theoretical result is a sparsity oracle inequality in probability for the true excess risk for a version of exponential weight estimator. We also propose a MCMC method to compute our estimator for reasonably large values of $p$.

preprint2010arXiv

Oracle Inequalities and Optimal Inference under Group Sparsity

We consider the problem of estimating a sparse linear regression vector $β^*$ under a gaussian noise model, for the purpose of both prediction and model selection. We assume that prior knowledge is available on the sparsity pattern, namely the set of variables is partitioned into prescribed groups, only few of which are relevant in the estimation process. This group sparsity assumption suggests us to consider the Group Lasso method as a means to estimate $β^*$. We establish oracle inequalities for the prediction and $\ell_2$ estimation errors of this estimator. These bounds hold under a restricted eigenvalue condition on the design matrix. Under a stronger coherence condition, we derive bounds for the estimation error for mixed $(2,p)$-norms with $1\le p\leq \infty$. When $p=\infty$, this result implies that a threshold version of the Group Lasso estimator selects the sparsity pattern of $β^*$ with high probability. Next, we prove that the rate of convergence of our upper bounds is optimal in a minimax sense, up to a logarithmic factor, for all estimators over a class of group sparse vectors. Furthermore, we establish lower bounds for the prediction and $\ell_2$ estimation errors of the usual Lasso estimator. Using this result, we demonstrate that the Group Lasso can achieve an improvement in the prediction and estimation properties as compared to the Lasso.

preprint2009arXiv

Taking Advantage of Sparsity in Multi-Task Learning

We study the problem of estimating multiple linear regression equations for the purpose of both prediction and variable selection. Following recent work on multi-task learning Argyriou et al. [2008], we assume that the regression vectors share the same sparsity pattern. This means that the set of relevant predictor variables is the same across the different equations. This assumption leads us to consider the Group Lasso as a candidate estimation method. We show that this estimator enjoys nice sparsity oracle inequalities and variable selection properties. The results hold under a certain restricted eigenvalue condition and a coherence condition on the design matrix, which naturally extend recent work in Bickel et al. [2007], Lounici [2008]. In particular, in the multi-task learning scenario, in which the number of tasks can grow, we are able to remove completely the effect of the number of predictor variables in the bounds. Finally, we show how our results can be extended to more general noise distributions, of which we only require the variance to be finite.

Karim Lounici

What is connected

Connect this record

See the researcher in context

Building this map preview

23 published item(s)

Geometric Dictionary Learning of Dynamical Systems with Optimal Transport

kooplearn: A Scikit-Learn Compatible Library of Algorithms for Evolution Operator Learning

AdaCap: Adaptive Capacity control for Feed-Forward Neural Networks

Meta Representation Learning with Contextual Linear Bandits

Sliding window strategy for convolutional spike sorting with Lasso : Algorithm, theoretical guarantees and complexity

Muddling Labels for Regularization, a novel approach to generalization

Optimizing generalization on the train set: a novel gradient-based framework to train parameters and hyperparameters simultaneously

New asymptotic results in principal component analysis

Nuclear norm penalization and optimal rates for noisy low rank matrix completion

Robust Matrix Completion

Asymptotics and Concentration Bounds for Bilinear Forms of Spectral Projectors of Sample Covariance

Estimation of Low-Rank Covariance Function

Minimax and adaptive estimation of the Wigner function in quantum homodyne tomography with noisy data

Normal approximation and concentration of spectral projectors of sample covariance

Concentration Inequalities and Moment Bounds for Sample Covariance Operators

High-dimensional covariance matrix estimation with missing observations

Sparse Principal Component Analysis with missing observations

Variable Selection with Exponential Weights and $l_0$-Penalization

Global uniform risk bounds for wavelet deconvolution estimators

Optimal spectral norm rates for noisy low-rank matrix completion

Pac-bayesian bounds for sparse regression estimation with exponential weights

Oracle Inequalities and Optimal Inference under Group Sparsity

Taking Advantage of Sparsity in Multi-Task Learning