Source author record

Nikita Zhivotovskiy

Nikita Zhivotovskiy appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.ST Statistics Theory math.PR Data Structures and Algorithms math.CO Computational Geometry Discrete Mathematics Information Theory math.IT math.MG

Catalog footprint

What is connected

13works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Self-Normalized Martingales and Uniform Regret Bounds for Linear Regression

Self-normalized martingale inequalities lie at the heart of confidence ellipsoids for online least squares and, more broadly, many bandit and reinforcement-learning results. Yet existing vector and scalar results typically rely on bounded covariates and an explicit regularization matrix, producing bounds that are \emph{not scale-invariant}: although the self-normalized quantity is scale-invariant by definition, its standard upper bounds are not. We characterize when scale-invariant upper bounds on self-normalized martingales are possible. Without further assumptions, we prove that nontrivial scale-invariant bounds exist only in dimension $d=1$; moreover, in $d=1$ we obtain $O(\log T)$ scale-invariant self-normalized bounds without any assumptions on the covariates. In contrast, for $d>1$ we show that no nontrivial scale-invariant bound can hold in full generality. We then connect this dichotomy to \emph{doubly-uniform} regret in online linear regression (i.e., regret bounds that are simultaneously independent of the covariate scale and the comparator norm) and use it to resolve the open question of Gaillard, Gerchinovitz, Huard, and Stoltz, \emph{``Uniform regret bounds over $\mathbb{R}^d$ for the sequential linear regression problem with the square loss''} (ALT 2019): in $d=1$ we give an explicit algorithm with $O(\log T)$ doubly-uniform regret, whereas for $d>1$ sublinear doubly-uniform regret is impossible. Finally, under a natural \emph{smoothness} condition (bounded Radon--Nikodym derivatives of the conditional covariate laws with respect to a fixed base measure), we recover sublinear regret for $d>1$ without bounded covariates and derive a self-normalized concentration inequality free of the usual regularization penalties, yielding arguably a first natural scale-invariant bound for adaptive, non-i.i.d. vector martingales.

preprint2022arXiv

A Regret-Variance Trade-Off in Online Learning

We consider prediction with expert advice for strongly convex and bounded losses, and investigate trade-offs between regret and "variance" (i.e., squared difference of learner's predictions and best expert predictions). With $K$ experts, the Exponentially Weighted Average (EWA) algorithm is known to achieve $O(\log K)$ regret. We prove that a variant of EWA either achieves a negative regret (i.e., the algorithm outperforms the best expert), or guarantees a $O(\log K)$ bound on both variance and regret. Building on this result, we show several examples of how variance of predictions can be exploited in learning. In the online to batch analysis, we show that a large empirical variance allows to stop the online to batch conversion early and outperform the risk of the best predictor in the class. We also recover the optimal rate of model selection aggregation when we do not consider early stopping. In online prediction with corrupted losses, we show that the effect of corruption on the regret can be compensated by a large variance. In online selective sampling, we design an algorithm that samples less when the variance is large, while guaranteeing the optimal regret bound in expectation. In online learning with abstention, we use a similar term as the variance to derive the first high-probability $O(\log K)$ regret bound in this setting. Finally, we extend our results to the setting of online linear regression.

preprint2022arXiv

A remark on Kashin's discrepancy argument and partial coloring in the Komlós conjecture

In this expository note, we discuss an early partial coloring result of B. Kashin [C. R. Acad. Bulgare Sci., 1985]. Although this result only implies Spencer's six standard deviations [Trans. Amer. Math. Soc., 1985] up to a $\log\log n$ factor, Kashin's argument gives a simple proof of the existence of a constant discrepancy partial coloring in the setup of Komlós conjecture.

preprint2022arXiv

Dimension-free Bounds for Sums of Independent Matrices and Simple Tensors via the Variational Principle

We consider the deviation inequalities for the sums of independent $d$ by $d$ random matrices, as well as rank one random tensors. Our focus is on the non-isotropic case and the bounds that do not depend explicitly on the dimension $d$, but rather on the effective rank. In an elementary and unified manner, we show the following results: 1) A deviation bound for the sums of independent positive-semi-definite matrices. This result complements the dimension-free bound of Koltchinskii and Lounici [Bernoulli, 2017] on the sample covariance matrix in the sub-Gaussian case. 2) A new bound for truncated covariance matrices that is used to prove a dimension-free version of the bound of Adamczak, Litvak, Pajor and Tomczak-Jaegermann [Journal Of Amer. Math. Soc., 2010] on the sample covariance matrix in the log-concave case. 3) Dimension-free bounds for the operator norm of the sums of random tensors of rank one formed either by sub-Gaussian or by log-concave random vectors. This complements the result of Guédon and Rudelson [Adv. in Math., 2007]. 4) A non-isotropic version of the result of Alesker [Geom. Asp. of Funct. Anal., 1995] on the deviation of the norm of sub-exponential random vectors. 5) A dimension-free lower tail bound for sums of positive semi-definite matrices with heavy-tailed entries, sharpening the bound of Oliveira [Prob. Th. and Rel. Fields, 2016]. Our approach is based on the duality formula between entropy and moment generating functions. In contrast to the known proofs of dimension-free bounds, we avoid Talagrand's majorizing measure theorem, as well as generic chaining bounds for empirical processes. Some of our tools were pioneered by O. Catoni and co-authors in the context of robust statistical estimation.

preprint2022arXiv

Exponential Savings in Agnostic Active Learning through Abstention

We show that in pool-based active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss $1/2$ of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in pool-based active classification under the model misspecification.

preprint2021arXiv

Distribution-Free Robust Linear Regression

We study random design linear regression with no assumptions on the distribution of the covariates and with a heavy-tailed response variable. In this distribution-free regression setting, we show that boundedness of the conditional second moment of the response given the covariates is a necessary and sufficient condition for achieving nontrivial guarantees. As a starting point, we prove an optimal version of the classical in-expectation bound for the truncated least squares estimator due to Györfi, Kohler, Krzyżak, and Walk. However, we show that this procedure fails with constant probability for some distributions despite its optimal in-expectation performance. Then, combining the ideas of truncated least squares, median-of-means procedures, and aggregation theory, we construct a non-linear estimator achieving excess risk of order $d/n$ with an optimal sub-exponential tail. While existing approaches to linear regression for heavy-tailed distributions focus on proper estimators that return linear functions, we highlight that the improperness of our procedure is necessary for attaining nontrivial guarantees in the distribution-free setting.

preprint2021arXiv

Suboptimality of Constrained Least Squares and Improvements via Non-Linear Predictors

We study the problem of predicting as well as the best linear predictor in a bounded Euclidean ball with respect to the squared loss. When only boundedness of the data generating distribution is assumed, we establish that the least squares estimator constrained to a bounded Euclidean ball does not attain the classical $O(d/n)$ excess risk rate, where $d$ is the dimension of the covariates and $n$ is the number of samples. In particular, we construct a bounded distribution such that the constrained least squares estimator incurs an excess risk of order $Ω(d^{3/2}/n)$ hence refuting a recent conjecture of Ohad Shamir [JMLR 2015]. In contrast, we observe that non-linear predictors can achieve the optimal rate $O(d/n)$ with no assumptions on the distribution of the covariates. We discuss additional distributional assumptions sufficient to guarantee an $O(d/n)$ excess risk rate for the least squares estimator. Among them are certain moment equivalence assumptions often used in the robust statistics literature. While such assumptions are central in the analysis of unbounded and heavy-tailed settings, our work indicates that in some cases, they also rule out unfavorable bounded distributions.

preprint2020arXiv

Fast Rates for Online Prediction with Abstention

In the setting of sequential prediction of individual $\{0, 1\}$-sequences with expert advice, we show that by allowing the learner to abstain from the prediction by paying a cost marginally smaller than $\frac 12$ (say, $0.49$), it is possible to achieve expected regret bounds that are independent of the time horizon $T$. We exactly characterize the dependence on the abstention cost $c$ and the number of experts $N$ by providing matching upper and lower bounds of order $\frac{\log N}{1-2c}$, which is to be contrasted with the best possible rate of $\sqrt{T\log N}$ that is available without the option to abstain. We also discuss various extensions of our model, including a setting where the sequence of abstention costs can change arbitrarily over time, where we show regret bounds interpolating between the slow and the fast rates mentioned above, under some natural assumptions on the sequence of abstention costs.

preprint2020arXiv

Noise sensitivity of the top eigenvector of a Wigner matrix

We investigate the noise sensitivity of the top eigenvector of a Wigner matrix in the following sense. Let $v$ be the top eigenvector of an $N\times N$ Wigner matrix. Suppose that $k$ randomly chosen entries of the matrix are resampled, resulting in another realization of the Wigner matrix with top eigenvector $v^{[k]}$. We prove that, with high probability, when $k \ll N^{5/3-o(1)}$, then $v$ and $v^{[k]}$ are almost collinear and when $k\gg N^{5/3}$, then $v^{[k]}$ is almost orthogonal to $v$.

preprint2020arXiv

Proper Learning, Helly Number, and an Optimal SVM Bound

The classical PAC sample complexity bounds are stated for any Empirical Risk Minimizer (ERM) and contain an extra logarithmic factor $\log(1/ε)$ which is known to be necessary for ERM in general. It has been recently shown by Hanneke (2016) that the optimal sample complexity of PAC learning for any VC class C is achieved by a particular improper learning algorithm, which outputs a specific majority-vote of hypotheses in C. This leaves the question of when this bound can be achieved by proper learning algorithms, which are restricted to always output a hypothesis from C. In this paper we aim to characterize the classes for which the optimal sample complexity can be achieved by a proper learning algorithm. We identify that these classes can be characterized by the dual Helly number, which is a combinatorial parameter that arises in discrete geometry and abstract convexity. In particular, under general conditions on C, we show that the dual Helly number is bounded if and only if there is a proper learner that obtains the optimal joint dependence on $ε$ and $δ$. As further implications of our techniques we resolve a long-standing open problem posed by Vapnik and Chervonenkis (1974) on the performance of the Support Vector Machine by proving that the sample complexity of SVM in the realizable case is $Θ((n/ε)+(1/ε)\log(1/δ))$, where $n$ is the dimension. This gives the first optimal PAC bound for Halfspaces achieved by a proper learning algorithm, and moreover is computationally efficient.

preprint2020arXiv

Sharper bounds for uniformly stable algorithms

Deriving generalization bounds for stable algorithms is a classical question in learning theory taking its roots in the early works by Vapnik and Chervonenkis (1974) and Rogers and Wagner (1978). In a series of recent breakthrough papers by Feldman and Vondrak (2018, 2019), it was shown that the best known high probability upper bounds for uniformly stable learning algorithms due to Bousquet and Elisseef (2002) are sub-optimal in some natural regimes. To do so, they proved two generalization bounds that significantly outperform the simple generalization bound of Bousquet and Elisseef (2002). Feldman and Vondrak also asked if it is possible to provide sharper bounds and prove corresponding high probability lower bounds. This paper is devoted to these questions: firstly, inspired by the original arguments of Feldman and Vondrak (2019), we provide a short proof of the moment bound that implies the generalization bound stronger than both recent results (Feldman and Vondrak, 2018, 2019). Secondly, we prove general lower bounds, showing that our moment bound is sharp (up to a logarithmic factor) unless some additional properties of the corresponding random variables are used. Our main probabilistic result is a general concentration inequality for weakly correlated random variables, which may be of independent interest.

preprint2020arXiv

When are epsilon-nets small?

In many interesting situations the size of epsilon-nets depends only on $ε$ together with different complexity measures. The aim of this paper is to give a systematic treatment of such complexity measures arising in Discrete and Computational Geometry and Statistical Learning, and to bridge the gap between the results appearing in these two fields. As a byproduct, we obtain several new upper bounds on the sizes of epsilon-nets that generalize/improve the best known general guarantees. In particular, our results work with regimes when small epsilon-nets of size $o(\frac{1}ε)$ exist, which are not usually covered by standard upper bounds. Inspired by results in Statistical Learning we also give a short proof of the Haussler's upper bound on packing numbers.

preprint2016arXiv

Permutational Rademacher Complexity: a New Complexity Measure for Transductive Learning

Transductive learning considers situations when a learner observes $m$ labelled training points and $u$ unlabelled test points with the final goal of giving correct answers for the test points. This paper introduces a new complexity measure for transductive learning called Permutational Rademacher Complexity (PRC) and studies its properties. A novel symmetrization inequality is proved, which shows that PRC provides a tighter control over expected suprema of empirical processes compared to what happens in the standard i.i.d. setting. A number of comparison results are also provided, which show the relation between PRC and other popular complexity measures used in statistical learning theory, including Rademacher complexity and Transductive Rademacher Complexity (TRC). We argue that PRC is a more suitable complexity measure for transductive learning. Finally, these results are combined with a standard concentration argument to provide novel data-dependent risk bounds for transductive learning.

Nikita Zhivotovskiy

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Self-Normalized Martingales and Uniform Regret Bounds for Linear Regression

A Regret-Variance Trade-Off in Online Learning

A remark on Kashin's discrepancy argument and partial coloring in the Komlós conjecture

Dimension-free Bounds for Sums of Independent Matrices and Simple Tensors via the Variational Principle

Exponential Savings in Agnostic Active Learning through Abstention

Distribution-Free Robust Linear Regression

Suboptimality of Constrained Least Squares and Improvements via Non-Linear Predictors

Fast Rates for Online Prediction with Abstention

Noise sensitivity of the top eigenvector of a Wigner matrix

Proper Learning, Helly Number, and an Optimal SVM Bound

Sharper bounds for uniformly stable algorithms

When are epsilon-nets small?

Permutational Rademacher Complexity: a New Complexity Measure for Transductive Learning