Source author record

Pierre Alquier

Pierre Alquier appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Machine Learning Methodology Computation Artificial Intelligence Applications math.OC math.PR Quantitative Methods

Catalog footprint

What is connected

29works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Estimation of time series by Maximum Mean Discrepancy

We define two minimum distance estimators for dependent data by minimizing some approximated Maximum Mean Discrepancy distances between the true empirical distribution of observations and their assumed (parametric) model distribution. When the latter one is intractable, it is approximated by simulation, allowing to accommodate most dynamic processes with latent variables. We derive the non-asymptotic and the large sample properties of our estimators in the context of absolutely regular/beta-mixing random elements. Our simulation experiments illustrate the robustness of our procedures to model misspecification, particularly in comparison with alternative standard estimation methods.

preprint2026arXiv

Position: agentic AI orchestration should be Bayes-consistent

LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions. Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target. In contrast, this paper argues that coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters. This paper articulates practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration, and provides concrete examples and design patterns to illustrate how calibrated beliefs and utility-aware policies can improve agentic AI orchestration.

preprint2022arXiv

Deviation inequalities for stochastic approximation by averaging

We introduce a class of Markov chains, that contains the model of stochastic approximation by averaging and non-averaging. Using martingale approximation method, we establish various deviation inequalities for separately Lipschitz functions of such a chain, with different moment conditions on some dominating random variables of martingale differences.Finally, we apply these inequalities to the stochastic approximation by averaging and empirical risk minimisation.

preprint2022arXiv

Estimation of copulas via Maximum Mean Discrepancy

This paper deals with robust inference for parametric copula models. Estimation using Canonical Maximum Likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the Maximum Mean Discrepancy (MMD) principle. We derive non-asymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequality holds without any assumption on the copula family, and can be applied in the presence of outliers or under misspecification. Moreover, in our MMD framework, the statistical inference of copula models for which there exists no density with respect to the Lebesgue measure on $[0,1]^d$, as the Marshall-Olkin copula, becomes feasible. A simulation study shows the robustness of our new procedures, especially compared to pseudo-maximum likelihood estimation. An R package implementing the MMD estimator for copula models is available.

preprint2022arXiv

Optimal quasi-Bayesian reduced rank regression with incomplete response

The aim of reduced rank regression is to connect multiple response variables to multiple predictors. This model is very popular, especially in biostatistics where multiple measurements on individuals can be re-used to predict multiple outputs. Unfortunately, there are often missing data in such datasets, making it difficult to use standard estimation tools. In this paper, we study the problem of reduced rank regression where the response matrix is incomplete. We propose a quasi-Bayesian approach to this problem, in the sense that the likelihood is replaced by a quasi-likelihood. We provide a tight oracle inequality, proving that our method is adaptive to the rank of the coefficient matrix. We describe a Langevin Monte Carlo algorithm for the computation of the posterior mean. Numerical comparison on synthetic and real data show that our method are competitive to the state-of-the-art where the rank is chosen by cross validation, and sometimes lead to an improvement.

preprint2022arXiv

Tight Risk Bound for High Dimensional Time Series Completion

Initially designed for independent datas, low-rank matrix completion was successfully applied in many domains to the reconstruction of partially observed high-dimensional time series. However, there is a lack of theory to support the application of these methods to dependent datas. In this paper, we propose a general model for multivariate, partially observed time series. We show that the least-square method with a rank penalty leads to reconstruction error of the same order as for independent datas. Moreover, when the time series has some additional properties such as periodicity or smoothness, the rate can actually be faster than in the independent case.

preprint2021arXiv

A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix

Continual learning (CL) is a setting in which an agent has to learn from an incoming stream of data during its entire lifetime. Although major advances have been made in the field, one recurring problem which remains unsolved is that of Catastrophic Forgetting (CF). While the issue has been extensively studied empirically, little attention has been paid from a theoretical angle. In this paper, we show that the impact of CF increases as two tasks increasingly align. We introduce a measure of task similarity called the NTK overlap matrix which is at the core of CF. We analyze common projected gradient algorithms and demonstrate how they mitigate forgetting. Then, we propose a variant of Orthogonal Gradient Descent (OGD) which leverages structure of the data through Principal Component Analysis (PCA). Experiments support our theoretical findings and show how our method can help reduce CF on classical CL datasets.

preprint2021arXiv

Non-exponentially weighted aggregation: regret bounds for unbounded loss functions

We tackle the problem of online optimization with a general, possibly unbounded, loss function. It is well known that when the loss is bounded, the exponentially weighted aggregation strategy (EWA) leads to a regret in $\sqrt{T}$ after $T$ steps. In this paper, we study a generalized aggregation strategy, where the weights no longer depend exponentially on the losses. Our strategy is based on Follow The Regularized Leader (FTRL): we minimize the expected losses plus a regularizer, that is here a $ϕ$-divergence. When the regularizer is the Kullback-Leibler divergence, we obtain EWA as a special case. Using alternative divergences enables unbounded losses, at the cost of a worst regret bound in some cases.

preprint2021arXiv

Understanding the population structure correction regression

Although genome-wide association studies (GWAS) on complex traits have achieved great successes, the current leading GWAS approaches simply perform to test each genotype-phenotype association separately for each genetic variant. Curiously, the statistical properties for using these approaches is not known when a joint model for the whole genetic variants is considered. Here we advance in GWAS in understanding the statistical properties of the "population structure correction" (PSC) approach, a standard univariate approach in GWAS. We further propose and analyse a correction to the PSC approach, termed as "corrected population correction" (CPC). Together with the theoretical results, numerical simulations show that CPC is always comparable or better than PSC, with a dramatic improvement in some special cases.

preprint2020arXiv

High dimensional VAR with low rank transition

We propose a vector auto-regressive (VAR) model with a low-rank constraint on the transition matrix. This new model is well suited to predict high-dimensional series that are highly correlated, or that are driven by a small number of hidden factors. We study estimation, prediction, and rank selection for this model in a very general setting. Our method shows excellent performances on a wide variety of simulated datasets. On macro-economic data from Giannone et al. (2015), our method is competitive with state-of-the-art methods in small dimension, and even improves on them in high dimension.

preprint2019arXiv

A Generalization Bound for Online Variational Inference

Bayesian inference provides an attractive online-learning framework to analyze sequential data, and offers generalization guarantees which hold even with model mismatch and adversaries. Unfortunately, exact Bayesian inference is rarely feasible in practice and approximation methods are usually employed, but do such methods preserve the generalization properties of Bayesian inference ? In this paper, we show that this is indeed the case for some variational inference (VI) algorithms. We consider a few existing online, tempered VI algorithms, as well as a new algorithm, and derive their generalization bounds. Our theoretical result relies on the convexity of the variational objective, but we argue that the result should hold more generally and present empirical evidence in support of this. Our work in this paper presents theoretical justifications in favor of online algorithms relying on approximate Bayesian methods.

preprint2019arXiv

Exponential inequalities for nonstationary Markov Chains

Exponential inequalities are main tools in machine learning theory. To prove exponential inequalities for non i.i.d random variables allows to extend many learning techniques to these variables. Indeed, much work has been done both on inequalities and learning theory for time series, in the past 15 years. However, for the non independent case, almost all the results concern stationary time series. This excludes many important applications: for example any series with a periodic behavior is non-stationary. In this paper, we extend the basic tools of Dedecker and Fan (2015) to nonstationary Markov chains. As an application, we provide a Bernstein-type inequality, and we deduce risk bounds for the prediction of periodic autoregressive processes with an unknown period.

preprint2019arXiv

Matrix factorization for multivariate time series analysis

Matrix factorization is a powerful data analysis tool. It has been used in multivariate time series analysis, leading to the decomposition of the series in a small set of latent factors. However, little is known on the statistical performances of matrix factorization for time series. In this paper, we extend the results known for matrix estimation in the i.i.d setting to time series. Moreover, we prove that when the series exhibit some additional structure like periodicity or smoothness, it is possible to improve on the classical rates of convergence.

preprint2018arXiv

Consistency of Variational Bayes Inference for Estimation and Model Selection in Mixtures

Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable posteriors thanks to optimization algorithms, is extremely popular in practice when dealing with complex models such as mixtures. The contribution of this paper is two-fold. First, we study the concentration of variational approximations of posteriors, which is still an open problem for general mixtures, and we derive consistency and rates of convergence. We also tackle the problem of model selection for the number of components: we study the approach already used in practice, which consists in maximizing a numerical criterion (the Evidence Lower Bound). We prove that this strategy indeed leads to strong oracle inequalities. We illustrate our theoretical results by applications to Gaussian and multinomial mixtures.

preprint2015arXiv

A Bayesian Approach for Noisy Matrix Completion: Optimal Rate under General Sampling Distribution

Bayesian methods for low-rank matrix completion with noise have been shown to be very efficient computationally. While the behaviour of penalized minimization methods is well understood both from the theoretical and computational points of view in this problem, the theoretical optimality of Bayesian estimators have not been explored yet. In this paper, we propose a Bayesian estimator for matrix completion under general sampling distribution. We also provide an oracle inequality for this estimator. This inequality proves that, whatever the rank of the matrix to be estimated, our estimator reaches the minimax-optimal rate of convergence (up to a logarithmic factor). We end the paper with a short simulation study.

preprint2015arXiv

Light and Widely Applicable MCMC: Approximate Bayesian Inference for Large Datasets

Light and Widely Applicable (LWA-) MCMC is a novel approximation of the Metropolis-Hastings kernel targeting a posterior distribution defined on a large number of observations. Inspired by Approximate Bayesian Computation, we design a Markov chain whose transition makes use of an unknown but fixed, fraction of the available data, where the random choice of sub-sample is guided by the fidelity of this sub-sample to the observed data, as measured by summary (or sufficient) statistics. LWA-MCMC is a generic and flexible approach, as illustrated by the diverse set of examples which we explore. In each case LWA-MCMC yields excellent performance and in some cases a dramatic improvement compared to existing methodologies.

preprint2015arXiv

On the properties of variational approximations of Gibbs posteriors

The PAC-Bayesian approach is a powerful set of techniques to derive non- asymptotic risk bounds for random estimators. The corresponding optimal distribution of estimators, usually called the Gibbs posterior, is unfortunately intractable. One may sample from it using Markov chain Monte Carlo, but this is often too slow for big datasets. We consider instead variational approximations of the Gibbs posterior, which are fast to compute. We undertake a general study of the properties of such approximations. Our main finding is that such a variational approximation has often the same rate of convergence as the original PAC-Bayesian procedure it approximates. We specialise our results to several learning tasks (classification, ranking, matrix completion),discuss how to implement a variational approximation in each case, and illustrate the good properties of said approximation on real datasets.

preprint2014arXiv

Bayesian matrix completion: prior specification

Low-rank matrix estimation from incomplete measurements recently received increased attention due to the emergence of several challenging applications, such as recommender systems; see in particular the famous Netflix challenge. While the behaviour of algorithms based on nuclear norm minimization is now well understood, an as yet unexplored avenue of research is the behaviour of Bayesian algorithms in this context. In this paper, we briefly review the priors used in the Bayesian literature for matrix completion. A standard approach is to assign an inverse gamma prior to the singular values of a certain singular value decomposition of the matrix of interest; this prior is conjugate. However, we show that two other types of priors (again for the singular values) may be conjugate for this model: a gamma prior, and a discrete prior. Conjugacy is very convenient, as it makes it possible to implement either Gibbs sampling or Variational Bayes. Interestingly enough, the maximum a posteriori for these different priors is related to the nuclear norm minimization problems. We also compare all these priors on simulated datasets, and on the classical MovieLens and Netflix datasets.

preprint2014arXiv

PAC-Bayesian AUC classification and scoring

We develop a scoring and classification procedure based on the PAC-Bayesian approach and the AUC (Area Under Curve) criterion. We focus initially on the class of linear score functions. We derive PAC-Bayesian non-asymptotic bounds for two types of prior for the score parameters: a Gaussian prior, and a spike-and-slab prior; the latter makes it possible to perform feature selection. One important advantage of our approach is that it is amenable to powerful Bayesian computational tools. We derive in particular a Sequential Monte Carlo algorithm, as an efficient method which may be used as a gold standard, and an Expectation-Propagation algorithm, as a much faster but approximate method. We also extend our method to a class of non-linear score functions, essentially leading to a nonparametric procedure, by considering a Gaussian process prior.

preprint2013arXiv

Rank penalized estimation of a quantum system

We introduce a new method to reconstruct the density matrix $ρ$ of a system of $n$-qubits and estimate its rank $d$ from data obtained by quantum state tomography measurements repeated $m$ times. The procedure consists in minimizing the risk of a linear estimator $\hatρ$ of $ρ$ penalized by given rank (from 1 to $2^n$), where $\hatρ$ is previously obtained by the moment method. We obtain simultaneously an estimator of the rank and the resulting density matrix associated to this rank. We establish an upper bound for the error of penalized estimator, evaluated with the Frobenius norm, which is of order $dn(4/3)^n /m$ and consistency for the estimator of the rank. The proposed methodology is computationaly efficient and is illustrated with some example states and real experimental data sets.

preprint2012arXiv

Fast rates in learning with dependent observations

In this paper we tackle the problem of fast rates in time series forecasting from a statistical learning perspective. In a serie of papers (e.g. Meir 2000, Modha and Masry 1998, Alquier and Wintenberger 2012) it is shown that the main tools used in learning theory with iid observations can be extended to the prediction of time series. The main message of these papers is that, given a family of predictors, we are able to build a new predictor that predicts the series as well as the best predictor in the family, up to a remainder of order $1/\sqrt{n}$. It is known that this rate cannot be improved in general. In this paper, we show that in the particular case of the least square loss, and under a strong assumption on the time series (phi-mixing) the remainder is actually of order $1/n$. Thus, the optimal rate for iid variables, see e.g. Tsybakov 2003, and individual sequences, see \cite{lugosi} is, for the first time, achieved for uniformly mixing processes. We also show that our method is optimal for aggregating sparse linear combinations of predictors.

preprint2012arXiv

Model selection for weakly dependent time series forecasting

Observing a stationary time series, we propose a two-step procedure for the prediction of the next value of the time series. The first step follows machine learning theory paradigm and consists in determining a set of possible predictors as randomized estimators in (possibly numerous) different predictive models. The second step follows the model selection paradigm and consists in choosing one predictor with good properties among all the predictors of the first steps. We study our procedure for two different types of bservations: causal Bernoulli shifts and bounded weakly dependent processes. In both cases, we give oracle inequalities: the risk of the chosen predictor is close to the best prediction risk in all predictive models that we consider. We apply our procedure for predictive models such as linear predictors, neural networks predictors and non-parametric autoregressive.

preprint2012arXiv

Prediction of quantiles by statistical learning and application to GDP forecasting

In this paper, we tackle the problem of prediction and confidence intervals for time series using a statistical learning approach and quantile loss functions. In a first time, we show that the Gibbs estimator (also known as Exponentially Weighted aggregate) is able to predict as well as the best predictor in a given family for a wide set of loss functions. In particular, using the quantile loss function of Koenker and Bassett (1978), this allows to build confidence intervals. We apply these results to the problem of prediction and confidence regions for the French Gross Domestic Product (GDP) growth, with promising results.

preprint2012arXiv

Prediction of time series by statistical learning: general losses and fast rates

We establish rates of convergences in time series forecasting using the statistical learning approach based on oracle inequalities. A series of papers extends the oracle inequalities obtained for iid observations to time series under weak dependence conditions. Given a family of predictors and $n$ observations, oracle inequalities state that a predictor forecasts the series as well as the best predictor in the family up to a remainder term $Δ_n$. Using the PAC-Bayesian approach, we establish under weak dependence conditions oracle inequalities with optimal rates of convergence. We extend previous results for the absolute loss function to any Lipschitz loss function with rates $Δ_n\sim\sqrt{c(Θ)/ n}$ where $c(Θ)$ measures the complexity of the model. We apply the method for quantile loss functions to forecast the french GDP. Under additional conditions on the loss functions (satisfied by the quadratic loss function) and on the time series, we refine the rates of convergence to $Δ_n \sim c(Θ)/n$. We achieve for the first time these fast rates for uniformly mixing processes. These rates are known to be optimal in the iid case and for individual sequences. In particular, we generalize the results of Dalalyan and Tsybakov on sparse regression estimation to the case of autoregression.

preprint2011arXiv

Generalization of l1 constraints for high dimensional regression problems

We focus on the high dimensional linear regression $Y\sim\mathcal{N}(Xβ^{*},σ^{2}I_{n})$, where $β^{*}\in\mathds{R}^{p}$ is the parameter of interest. In this setting, several estimators such as the LASSO and the Dantzig Selector are known to satisfy interesting properties whenever the vector $β^{*}$ is sparse. Interestingly both of the LASSO and the Dantzig Selector can be seen as orthogonal projections of 0 into $\mathcal{DC}(s)=\{β\in\mathds{R}^{p},\|X'(Y-Xβ)\|_{\infty}\leq s\}$ - using an $\ell_{1}$ distance for the Dantzig Selector and $\ell_{2}$ for the LASSO. For a well chosen $s>0$, this set is actually a confidence region for $β^{*}$. In this paper, we investigate the properties of estimators defined as projections on $\mathcal{DC}(s)$ using general distances. We prove that the obtained estimators satisfy oracle properties close to the one of the LASSO and Dantzig Selector. On top of that, it turns out that these estimators can be tuned to exploit a different sparsity or/and slightly different estimation objectives.

preprint2011arXiv

Pac-bayesian bounds for sparse regression estimation with exponential weights

We consider the sparse regression model where the number of parameters $p$ is larger than the sample size $n$. The difficulty when considering high-dimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The BIC estimator for instance performs well from the statistical point of view \cite{BTW07} but can only be computed for values of $p$ of at most a few tens. The Lasso estimator is solution of a convex minimization problem, hence computable for large value of $p$. However stringent conditions on the design are required to establish fast rates of convergence for this estimator. Dalalyan and Tsybakov \cite{arnak} propose a method achieving a good compromise between the statistical and computational aspects of the problem. Their estimator can be computed for reasonably large $p$ and satisfies nice statistical properties under weak assumptions on the design. However, \cite{arnak} proposes sparsity oracle inequalities in expectation for the empirical excess risk only. In this paper, we propose an aggregation procedure similar to that of \cite{arnak} but with improved statistical performances. Our main theoretical result is a sparsity oracle inequality in probability for the true excess risk for a version of exponential weight estimator. We also propose a MCMC method to compute our estimator for reasonably large values of $p$.

preprint2011arXiv

Sparse single-index model

Let $(\bX, Y)$ be a random pair taking values in $\mathbb R^p \times \mathbb R$. In the so-called single-index model, one has $Y=f^{\star}(θ^{\star T}\bX)+\bW$, where $f^{\star}$ is an unknown univariate measurable function, $θ^{\star}$ is an unknown vector in $\mathbb R^d$, and $W$ denotes a random noise satisfying $\mathbb E[\bW|\bX]=0$. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations ("$p$ larger than $n$" paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures.

preprint2011arXiv

Sparsity considerations for dependent observations

The aim of this paper is to provide a comprehensive introduction for the study of L1-penalized estimators in the context of dependent observations. We define a general $\ell_{1}$-penalized estimator for solving problems of stochastic optimization. This estimator turns out to be the LASSO in the regression estimation setting. Powerful theoretical guarantees on the statistical performances of the LASSO were provided in recent papers, however, they usually only deal with the iid case. Here, we study our estimator under various dependence assumptions.

preprint2010arXiv

Transductive versions of the LASSO and the Dantzig Selector

Transductive methods are useful in prediction problems when the training dataset is composed of a large number of unlabeled observations and a smaller number of labeled observations. In this paper, we propose an approach for developing transductive prediction procedures that are able to take advantage of the sparsity in the high dimensional linear regression. More precisely, we define transductive versions of the LASSO and the Dantzig Selector . These procedures combine labeled and unlabeled observations of the training dataset to produce a prediction for the unlabeled observations. We propose an experimental study of the transductive estimators, that shows that they improve the LASSO and Dantzig Selector in many situations, and particularly in high dimensional problems when the predictors are correlated. We then provide non-asymptotic theoretical guarantees for these estimation methods. Interestingly, our theoretical results show that the Transductive LASSO and Dantzig Selector satisfy sparsity inequalities under weaker assumptions than those required for the "original" LASSO.

Pierre Alquier

What is connected

Connect this record

See the researcher in context

Building this map preview

29 published item(s)

Estimation of time series by Maximum Mean Discrepancy

Position: agentic AI orchestration should be Bayes-consistent

Deviation inequalities for stochastic approximation by averaging

Estimation of copulas via Maximum Mean Discrepancy

Optimal quasi-Bayesian reduced rank regression with incomplete response

Tight Risk Bound for High Dimensional Time Series Completion

A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix

Non-exponentially weighted aggregation: regret bounds for unbounded loss functions

Understanding the population structure correction regression

High dimensional VAR with low rank transition

A Generalization Bound for Online Variational Inference

Exponential inequalities for nonstationary Markov Chains

Matrix factorization for multivariate time series analysis

Consistency of Variational Bayes Inference for Estimation and Model Selection in Mixtures

A Bayesian Approach for Noisy Matrix Completion: Optimal Rate under General Sampling Distribution

Light and Widely Applicable MCMC: Approximate Bayesian Inference for Large Datasets

On the properties of variational approximations of Gibbs posteriors

Bayesian matrix completion: prior specification

PAC-Bayesian AUC classification and scoring

Rank penalized estimation of a quantum system

Fast rates in learning with dependent observations

Model selection for weakly dependent time series forecasting

Prediction of quantiles by statistical learning and application to GDP forecasting

Prediction of time series by statistical learning: general losses and fast rates

Generalization of l1 constraints for high dimensional regression problems

Pac-bayesian bounds for sparse regression estimation with exponential weights

Sparse single-index model

Sparsity considerations for dependent observations

Transductive versions of the LASSO and the Dantzig Selector