Researcher profile

Alexander Rakhlin

Alexander Rakhlin contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
24works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

24 published item(s)

preprint2026arXiv

Self-Normalized Martingales and Uniform Regret Bounds for Linear Regression

Self-normalized martingale inequalities lie at the heart of confidence ellipsoids for online least squares and, more broadly, many bandit and reinforcement-learning results. Yet existing vector and scalar results typically rely on bounded covariates and an explicit regularization matrix, producing bounds that are \emph{not scale-invariant}: although the self-normalized quantity is scale-invariant by definition, its standard upper bounds are not. We characterize when scale-invariant upper bounds on self-normalized martingales are possible. Without further assumptions, we prove that nontrivial scale-invariant bounds exist only in dimension $d=1$; moreover, in $d=1$ we obtain $O(\log T)$ scale-invariant self-normalized bounds without any assumptions on the covariates. In contrast, for $d>1$ we show that no nontrivial scale-invariant bound can hold in full generality. We then connect this dichotomy to \emph{doubly-uniform} regret in online linear regression (i.e., regret bounds that are simultaneously independent of the covariate scale and the comparator norm) and use it to resolve the open question of Gaillard, Gerchinovitz, Huard, and Stoltz, \emph{``Uniform regret bounds over $\mathbb{R}^d$ for the sequential linear regression problem with the square loss''} (ALT 2019): in $d=1$ we give an explicit algorithm with $O(\log T)$ doubly-uniform regret, whereas for $d>1$ sublinear doubly-uniform regret is impossible. Finally, under a natural \emph{smoothness} condition (bounded Radon--Nikodym derivatives of the conditional covariate laws with respect to a fixed base measure), we recover sublinear regret for $d>1$ without bounded covariates and derive a self-normalized concentration inequality free of the usual regularization penalties, yielding arguably a first natural scale-invariant bound for adaptive, non-i.i.d. vector martingales.

preprint2022arXiv

Damped Online Newton Step for Portfolio Selection

We revisit the classic online portfolio selection problem, where at each round a learner selects a distribution over a set of portfolios to allocate its wealth. It is known that for this problem a logarithmic regret with respect to Cover's loss is achievable using the Universal Portfolio Selection algorithm, for example. However, all existing algorithms that achieve a logarithmic regret for this problem have per-round time and space complexities that scale polynomially with the total number of rounds, making them impractical. In this paper, we build on the recent work by Haipeng et al. 2018 and present the first practical online portfolio selection algorithm with a logarithmic regret and whose per-round time and space complexities depend only logarithmically on the horizon. Behind our approach are two key technical novelties of independent interest. We first show that the Damped Online Newton steps can approximate mirror descent iterates well, even when dealing with time-varying regularizers. Second, we present a new meta-algorithm that achieves an adaptive logarithmic regret (i.e. a logarithmic regret on any sub-interval) for mixable losses.

preprint2022arXiv

Intrinsic Dimension Estimation Using Wasserstein Distances

It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.

preprint2022arXiv

On the Complexity of Adversarial Decision Making

A central problem in online learning and decision making -- from bandits to reinforcement learning -- is to understand what modeling assumptions lead to sample-efficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show -- via new upper and lower bounds -- that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results -- both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy and the Exploration-by-Optimization objective of Lattimore and György.

preprint2022arXiv

Smoothed Online Learning is as Easy as Statistical Learning

Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.

preprint2021arXiv

On the Minimal Error of Empirical Risk Minimization

We study the minimal error of the Empirical Risk Minimization (ERM) procedure in the task of regression, both in the random and the fixed design settings. Our sharp lower bounds shed light on the possibility (or impossibility) of adapting to simplicity of the model generating the data. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM for both Donsker and non-Donsker classes. We also discuss our results through the lens of recent studies on interpolation in overparameterized models.

preprint2021arXiv

Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy

Motivated by modern applications, such as online advertisement and recommender systems, we study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select $k$ arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of $O(k\sqrt{(A-k+1)T \log (|\mathcal{F}|T)})$, where $A$ is the total number of arms and $\mathcal{F}$ is the class containing the regression function, while only requiring $\tilde{O}(A)$ computation per time step. In the extreme setting, where the total number of arms can be in the millions, we propose a practically-motivated arm hierarchy model that induces a certain structure in mean rewards to ensure statistical and computational efficiency. The hierarchical structure allows for an exponential reduction in the number of relevant arms for each context, thus resulting in a regret guarantee of $O(k\sqrt{(\log A-k+1)T \log (|\mathcal{F}|T)})$. Finally, we implement our algorithm using a hierarchical linear function class and show superior performance with respect to well-known benchmarks on simulated bandit feedback experiments using extreme multi-label classification datasets. On a dataset with three million arms, our reduction scheme has an average inference time of only 7.9 milliseconds, which is a 100x improvement.

preprint2020arXiv

Adaptive Online Learning

We propose a general framework for studying adaptive regret bounds in the online learning framework, including model selection bounds and data-dependent bounds. Given a data- or model-dependent bound we ask, "Does there exist some algorithm achieving this bound?" We show that modifications to recently introduced sequential complexity measures can be used to answer this question by providing sufficient conditions under which adaptive rates can be achieved. In particular each adaptive rate induces a set of so-called offset complexity measures, and obtaining small upper bounds on these quantities is sufficient to demonstrate achievability. A cornerstone of our analysis technique is the use of one-sided tail inequalities to bound suprema of offset random processes. Our framework recovers and improves a wide variety of adaptive bounds including quantile bounds, second-order data-dependent bounds, and small loss bounds. In addition we derive a new type of adaptive bound for online linear optimization based on the spectral norm, as well as a new online PAC-Bayes theorem that holds for countably infinite sets.

preprint2020arXiv

Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles

A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks such as classification and regression. Algorithms based on regression have shown promising empirical success, but theoretical guarantees have remained elusive except in special cases. We provide the first universal and optimal reduction from contextual bandits to online regression. We show how to transform any oracle for online regression with a given value function class into an algorithm for contextual bandits with the induced policy class, with no overhead in runtime or memory requirements. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the oracle obtains the optimal rate for regression. Compared to previous results, our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.

preprint2020arXiv

Data Driven Estimation of Stochastic Switched Linear Systems of Unknown Order

We address the problem of learning the parameters of a mean square stable switched linear systems (SLS) with unknown latent space dimension, or \textit{order}, from its noisy input--output data. In particular, we focus on learning a good lower order approximation of the underlying model allowed by finite data. Motivated by subspace-based algorithms in system theory, we construct a Hankel-like matrix from finite noisy data using ordinary least squares. Such a formulation circumvents the non-convexities that arise in system identification, and allows for accurate estimation of the underlying SLS as data size increases. Since the model order is unknown, the key idea of our approach is model order selection based on purely data dependent quantities. We construct Hankel-like matrices from data of dimension obtained from the order selection procedure. By exploiting tools from theory of model reduction for SLS, we obtain suitable approximations via singular value decomposition (SVD) and show that the system parameter estimates are close to a balanced truncated realization of the underlying system with high probability.

preprint2020arXiv

Fast Mixing of Multi-Scale Langevin Dynamics under the Manifold Hypothesis

Recently, the task of image generation has attracted much attention. In particular, the recent empirical successes of the Markov Chain Monte Carlo (MCMC) technique of Langevin Dynamics have prompted a number of theoretical advances; despite this, several outstanding problems remain. First, the Langevin Dynamics is run in very high dimension on a nonconvex landscape; in the worst case, due to the NP-hardness of nonconvex optimization, it is thought that Langevin Dynamics mixes only in time exponential in the dimension. In this work, we demonstrate how the manifold hypothesis allows for the considerable reduction of mixing time, from exponential in the ambient dimension to depending only on the (much smaller) intrinsic dimension of the data. Second, the high dimension of the sampling space significantly hurts the performance of Langevin Dynamics; we leverage a multi-scale approach to help ameliorate this issue and observe that this multi-resolution algorithm allows for a trade-off between image quality and computational expense in generation.

preprint2020arXiv

Learning nonlinear dynamical systems from a single trajectory

We introduce algorithms for learning nonlinear dynamical systems of the form $x_{t+1}=σ(Θ^{\star}x_t)+\varepsilon_t$, where $Θ^{\star}$ is a weight matrix, $σ$ is a nonlinear link function, and $\varepsilon_t$ is a mean-zero noise process. We give an algorithm that recovers the weight matrix $Θ^{\star}$ from a single trajectory with optimal sample complexity and linear running time. The algorithm succeeds under weaker statistical assumptions than in previous work, and in particular i) does not require a bound on the spectral norm of the weight matrix $Θ^{\star}$ (rather, it depends on a generalization of the spectral radius) and ii) enjoys guarantees for non-strictly-increasing link functions such as the ReLU. Our analysis has two key components: i) we give a general recipe whereby global stability for nonlinear dynamical systems can be used to certify that the state-vector covariance is well-conditioned, and ii) using these tools, we extend well-known algorithms for efficiently learning generalized linear models to the dependent setting.

preprint2020arXiv

Nonparametric Finite Time LTI System Identification

We address the problem of learning the parameters of a stable linear time invariant (LTI) system or linear dynamical system (LDS) with unknown latent space dimension, or order, from a single time--series of noisy input-output data. We focus on learning the best lower order approximation allowed by finite data. Motivated by subspace algorithms in systems theory, where the doubly infinite system Hankel matrix captures both order and good lower order approximations, we construct a Hankel-like matrix from noisy finite data using ordinary least squares. This circumvents the non-convexities that arise in system identification, and allows accurate estimation of the underlying LTI system. Our results rely on careful analysis of self-normalized martingale difference terms that helps bound identification error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and find an accurate realization of system parameters, corresponding to that order, by an approach that is closely related to the Ho-Kalman subspace algorithm. We demonstrate that the proposed model order selection procedure is not overly conservative, i.e., for the given data length it is not possible to estimate higher order models or find higher order approximations with reasonable accuracy.

preprint2020arXiv

On Suboptimality of Least Squares with Application to Estimation of Convex Bodies

We develop a technique for establishing lower bounds on the sample complexity of Least Squares (or, Empirical Risk Minimization) for large classes of functions. As an application, we settle an open problem regarding optimality of Least Squares in estimating a convex set from noisy support function measurements in dimension $d\geq 6$. Specifically, we establish that Least Squares is mimimax sub-optimal, and achieves a rate of $\tildeΘ_d(n^{-2/(d-1)})$ whereas the minimax rate is $Θ_d(n^{-4/(d+3)})$.

preprint2020arXiv

On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels

We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of $d = n^α$, $α\in(0,1)$, for the input dimension $d$ and sample size $n$. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furthermore, locations of the peaks in our experiments match our theoretical predictions. Since gradient flow on appropriately initialized wide neural networks converges to a minimum-norm interpolant with respect to a certain kernel, our analysis also yields novel estimation and generalization guarantees for these over-parametrized models. At the heart of our analysis is a study of spectral properties of the random kernel matrix restricted to a filtration of eigen-spaces of the population covariance operator, and may be of independent interest.

preprint2020arXiv

Optimality of Maximum Likelihood for Log-Concave Density Estimation and Bounded Convex Regression

In this paper, we study two problems: (1) estimation of a $d$-dimensional log-concave distribution and (2) bounded multivariate convex regression with random design with an underlying log-concave density or a compactly supported distribution with a continuous density. First, we show that for all $d \ge 4$ the maximum likelihood estimators of both problems achieve an optimal risk of $Θ_d(n^{-2/(d+1)})$ (up to a logarithmic factor) in terms of squared Hellinger distance and $L_2$ squared distance, respectively. Previously, the optimality of both these estimators was known only for $d\le 3$. We also prove that the $ε$-entropy numbers of the two aforementioned families are equal up to logarithmic factors. We complement these results by proving a sharp bound $Θ_d(n^{-2/(d+4)})$ on the minimax rate (up to logarithmic factors) with respect to the total variation distance. Finally, we prove that estimating a log-concave density - even a uniform distribution on a convex set - up to a fixed accuracy requires the number of samples \emph{at least} exponential in the dimension. We do that by improving the dimensional constant in the best known lower bound for the minimax rate from $2^{-d}\cdot n^{-2/(d+1)}$ to $c\cdot n^{-2/(d+1)}$ (when $d\geq 2$).

preprint2019arXiv

Fisher-Rao Metric, Geometry, and Complexity of Neural Networks

We study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity --- the Fisher-Rao norm --- that possesses desirable invariance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR-10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multi-layer rectifier networks.

preprint2019arXiv

Just Interpolate: Kernel "Ridgeless" Regression Can Generalize

In the absence of explicit regularization, Kernel "Ridgeless" Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.

preprint2017arXiv

Weighted Message Passing and Minimum Energy Flow for Heterogeneous Stochastic Block Models with Side Information

We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum energy flow and the eigenvectors of a certain Markov transition matrix. The general SBM considered in this paper allows for unequal-size communities, degree heterogeneity, and different connection probabilities among blocks. We focus on how to optimally weigh the message passing to improve misclassification.

preprint2016arXiv

On Detection and Structural Reconstruction of Small-World Random Networks

In this paper, we study detection and fast reconstruction of the celebrated Watts-Strogatz (WS) small-world random graph model \citep{watts1998collective} which aims to describe real-world complex networks that exhibit both high clustering and short average length properties. The WS model with neighborhood size $k$ and rewiring probability probability $β$ can be viewed as a continuous interpolation between a deterministic ring lattice graph and the Erdős-Rényi random graph. We study both the computational and statistical aspects of detecting the deterministic ring lattice structure (or local geographical links, strong ties) in the presence of random connections (or long range links, weak ties), and for its recovery. The phase diagram in terms of $(k,β)$ is partitioned into several regions according to the difficulty of the problem. We propose distinct methods for the various regions.

preprint2015arXiv

Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix

The interplay between computational efficiency and statistical accuracy in high-dimensional inference has drawn increasing attention in the literature. In this paper, we study computational and statistical boundaries for submatrix localization. Given one observation of (one or multiple non-overlapping) signal submatrix (of magnitude $λ$ and size $k_m \times k_n$) contaminated with a noise matrix (of size $m \times n$), we establish two transition thresholds for the signal to noise $λ/σ$ ratio in terms of $m$, $n$, $k_m$, and $k_n$. The first threshold, $\sf SNR_c$, corresponds to the computational boundary. Below this threshold, it is shown that no polynomial time algorithm can succeed in identifying the submatrix, under the \textit{hidden clique hypothesis}. We introduce adaptive linear time spectral algorithms that identify the submatrix with high probability when the signal strength is above the threshold $\sf SNR_c$. The second threshold, $\sf SNR_s$, captures the statistical boundary, below which no method can succeed with probability going to one in the minimax sense. The exhaustive search method successfully finds the submatrix above this threshold. The results show an interesting phenomenon that $\sf SNR_c$ is always significantly larger than $\sf SNR_s$, which implies an essential gap between statistical optimality and computational efficiency for submatrix localization.

preprint2015arXiv

Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions

We consider the problem of optimizing an approximately convex function over a bounded convex set in $\mathbb{R}^n$ using only function evaluations. The problem is reduced to sampling from an \emph{approximately} log-concave distribution using the Hit-and-Run method, which is shown to have the same $\mathcal{O}^*$ complexity as sampling from log-concave distributions. In addition to extend the analysis for log-concave distributions to approximate log-concave distributions, the implementation of the 1-dimensional sampler of the Hit-and-Run walk requires new methods and analysis. The algorithm then is based on simulated annealing which does not relies on first order conditions which makes it essentially immune to local minima. We then apply the method to different motivating problems. In the context of zeroth order stochastic convex optimization, the proposed method produces an $ε$-minimizer after $\mathcal{O}^*(n^{7.5}ε^{-2})$ noisy function evaluations by inducing a $\mathcal{O}(ε/n)$-approximately log concave distribution. We also consider in detail the case when the "amount of non-convexity" decays towards the optimum of the function. Other applications of the method discussed in this work include private computation of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic programming for online learning.

preprint2015arXiv

Geometric Inference for General High-Dimensional Linear Inverse Problems

This paper presents a unified geometric framework for the statistical analysis of a general ill-posed linear inverse model which includes as special cases noisy compressed sensing, sign vector recovery, trace regression, orthogonal matrix estimation, and noisy matrix completion. We propose computationally feasible convex programs for statistical inference including estimation, confidence intervals and hypothesis testing. A theoretical framework is developed to characterize the local estimation rate of convergence and to provide statistical inference guarantees. Our results are built based on the local conic geometry and duality. The difficulty of statistical inference is captured by the geometric characterization of the local tangent cone through the Gaussian width and Sudakov minoration estimate.

preprint2015arXiv

Learning with Square Loss: Localization through Offset Rademacher Complexity

We consider regression with square loss and general classes of functions without the boundedness assumption. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step estimator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of \citep{RakSriTsy15} for the bounded case while also providing guarantees without the boundedness assumption.