Source author record

Matus Telgarsky

Matus Telgarsky appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.OC Computer Science and Game Theory math.NA math.PR Neural and Evolutionary Computing Data Structures and Algorithms math.ST Numerical Analysis Statistics Theory

Catalog footprint

What is connected

23works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

Basic Inequalities for First-Order Optimization with Applications to Statistical Risk Analysis

We introduce \textit{basic inequalities} for first-order iterative optimization algorithms, forming a simple and versatile framework that connects implicit and explicit regularization. While related inequalities appear in the literature, we isolate and highlight a specific form and develop it as a well-rounded tool for statistical analysis. Let $f$ denote the objective function to be optimized. Given a first-order iterative algorithm initialized at $θ_0$ with current iterate $θ_T$, the basic inequality upper bounds $f(θ_T)-f(z)$ for any reference point $z$ in terms of the accumulated step sizes and the distances between $θ_0$, $θ_T$, and $z$. The bound translates the number of iterations into an effective regularization coefficient in the loss function. We demonstrate this framework through analyses of training dynamics and prediction risk bounds. In addition to revisiting and refining known results on gradient descent, we provide new results for mirror descent with Bregman divergence projection, for generalized linear models trained by gradient descent and exponentiated gradient descent, and for randomized predictors. We illustrate and supplement these theoretical findings with experiments on generalized linear models.

preprint2022arXiv

Actor-critic is implicitly biased towards high entropy optimal policies

We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like $ε$-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution.

preprint2022arXiv

Feature selection with gradient descent on two-layer networks in low-rotation regimes

This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions and other tools which will hopefully aid future work.

preprint2022arXiv

Stochastic linear optimization never overfits with quadratically-bounded losses on general data

This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e.g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as condition numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error. The proof technique is an elementary and versatile coupling argument, and is demonstrated here in the following settings: stochastic MD under realizability; stochastic MD for general Markov data; batch MD for general IID data; stochastic MD on heavy-tailed data (still without projections); stochastic TD on Markov chains (all prior stochastic TD bounds are in expectation).

preprint2020arXiv

Gradient descent follows the regularization path for general losses

Recent work across many machine learning disciplines has highlighted that standard descent methods, even without explicit regularization, do not merely minimize the training error, but also exhibit an implicit bias. This bias is typically towards a certain regularized solution, and relies upon the details of the learning process, for instance the use of the cross-entropy loss. In this work, we show that for empirical risk minimization over linear predictors with arbitrary convex, strictly decreasing losses, if the risk does not attain its infimum, then the gradient-descent path and the algorithm-independent regularization path converge to the same direction (whenever either converges to a direction). Using this result, we provide a justification for the widely-used exponentially-tailed losses (such as the exponential loss or the logistic loss): while this convergence to a direction for exponentially-tailed losses is necessarily to the maximum-margin direction, other losses such as polynomially-tailed losses may induce convergence to a direction with a poor margin.

preprint2020arXiv

Neural tangent kernels, transportation mappings, and universal approximation

This paper establishes rates of universal approximation for the shallow neural tangent kernel (NTK): network weights are only allowed microscopic changes from random initialization, which entails that activations are mostly unchanged, and the network is nearly equivalent to its linearization. Concretely, the paper has two main contributions: a generic scheme to approximate functions with the NTK by sampling from transport mappings between the initial weights and their desired values, and the construction of transport mappings via Fourier transforms. Regarding the first contribution, the proof scheme provides another perspective on how the NTK regime arises from rescaling: redundancy in the weights due to resampling allows individual weights to be scaled down. Regarding the second contribution, the most notable transport mapping asserts that roughly $1 / δ^{10d}$ nodes are sufficient to approximate continuous functions, where $δ$ depends on the continuity properties of the target function. By contrast, nearly the same proof yields a bound of $1 / δ^{2d}$ for shallow ReLU networks; this gap suggests a tantalizing direction for future work, separating shallow ReLU networks and their linearization.

preprint2020arXiv

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Recent theoretical work has guaranteed that overparameterized networks trained by gradient descent achieve arbitrarily low training error, and sometimes even low test error. The required width, however, is always polynomial in at least one of the sample size $n$, the (inverse) target error $1/ε$, and the (inverse) failure probability $1/δ$. This work shows that $\widetildeΘ(1/ε)$ iterations of gradient descent with $\widetildeΩ(1/ε^2)$ training examples on two-layer ReLU networks of any width exceeding $\mathrm{polylog}(n,1/ε,1/δ)$ suffice to achieve a test misclassification error of $ε$. We also prove that stochastic gradient descent can achieve $ε$ test error with polylogarithmic width and $\widetildeΘ(1/ε)$ samples. The analysis relies upon the separation margin of the limiting kernel, which is guaranteed positive, can distinguish between true labels and random labels, and can give a tight sample-complexity analysis in the infinite-width setting

preprint2016arXiv

Benefits of depth in neural networks

For any positive integer $k$, there exist neural networks with $Θ(k^3)$ layers, $Θ(1)$ nodes per layer, and $Θ(1)$ distinct parameters which can not be approximated by networks with $\mathcal{O}(k)$ layers unless they are exponentially large --- they must possess $Ω(2^k)$ nodes. This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with ReLU and maximization gates, sum-product networks, and boosted decision trees (in this last case with a stronger separation: $Ω(2^{k^3})$ total tree nodes are required).

preprint2016arXiv

Greedy bi-criteria approximations for $k$-medians and $k$-means

This paper investigates the following natural greedy procedure for clustering in the bi-criterion setting: iteratively grow a set of centers, in each round adding the center from a candidate set that maximally decreases clustering cost. In the case of $k$-medians and $k$-means, the key results are as follows. $\bullet$ When the method considers all data points as candidate centers, then selecting $\mathcal{O}(k\log(1/\varepsilon))$ centers achieves cost at most $2+\varepsilon$ times the optimal cost with $k$ centers. $\bullet$ Alternatively, the same guarantees hold if each round samples $\mathcal{O}(k/\varepsilon^5)$ candidate centers proportionally to their cluster cost (as with $\texttt{kmeans++}$, but holding centers fixed). $\bullet$ In the case of $k$-means, considering an augmented set of $n^{\lceil1/\varepsilon\rceil}$ candidate centers gives $1+\varepsilon$ approximation with $\mathcal{O}(k\log(1/\varepsilon))$ centers, the entire algorithm taking $\mathcal{O}(dk\log(1/\varepsilon)n^{1+\lceil1/\varepsilon\rceil})$ time, where $n$ is the number of data points in $\mathbb{R}^d$. $\bullet$ In the case of Euclidean $k$-medians, generating a candidate set via $n^{\mathcal{O}(1/\varepsilon^2)}$ executions of stochastic gradient descent with adaptively determined constraint sets will once again give approximation $1+\varepsilon$ with $\mathcal{O}(k\log(1/\varepsilon))$ centers in $dk\log(1/\varepsilon)n^{\mathcal{O}(1/\varepsilon^2)}$ time. Ancillary results include: guarantees for cluster costs based on powers of metrics; a brief, favorable empirical evaluation against $\texttt{kmeans++}$; data-dependent bounds allowing $1+\varepsilon$ in the first two bullets above, for example with $k$-medians over finite metric spaces.

preprint2016arXiv

Rate of Price Discovery in Iterative Combinatorial Auctions

We study a class of iterative combinatorial auctions which can be viewed as subgradient descent methods for the problem of pricing bundles to balance supply and demand. We provide concrete convergence rates for auctions in this class, bounding the number of auction rounds needed to reach clearing prices. Our analysis allows for a variety of pricing schemes, including item, bundle, and polynomial pricing, and the respective convergence rates confirm that more expressive pricing schemes come at the cost of slower convergence. We consider two models of bidder behavior. In the first model, bidders behave stochastically according to a random utility model, which includes standard best-response bidding as a special case. In the second model, bidders behave arbitrarily (even adversarially), and meaningful convergence relies on properly designed activity rules.

preprint2015arXiv

Convex Risk Minimization and Conditional Probability Estimation

This paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem. Unlike most previous work, we give results that are general enough to include cases in which no minimum exists, as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that any sequence of predictors minimizing convex risk over the source distribution will converge to this unique model when the class of predictors is linear (but potentially of infinite dimension). Secondly, we show the same result holds for \emph{empirical} risk minimization whenever this class of predictors is finite dimensional, where the essential technical contribution is a norm-free generalization bound.

preprint2015arXiv

Representation Benefits of Deep Feedforward Networks

This note provides a family of classification problems, indexed by a positive integer $k$, where all shallow networks with fewer than exponentially (in $k$) many nodes exhibit error at least $1/6$, whereas a deep network with 2 nodes in each of $2k$ layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated $k$ times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.

preprint2014arXiv

Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines.

preprint2014arXiv

Tensor decompositions for learning latent variable models

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

preprint2013arXiv

Boosting with the Logistic Loss is Consistent

This manuscript provides optimization guarantees, generalization bounds, and statistical consistency results for AdaBoost variants which replace the exponential loss with the logistic and similar losses (specifically, twice differentiable convex losses which are Lipschitz and tend to zero on one side). The heart of the analysis is to show that, in lieu of explicit regularization and constraints, the structure of the problem is fairly rigidly controlled by the source distribution itself. The first control of this type is in the separable case, where a distribution-dependent relaxed weak learning rate induces speedy convergence with high probability over any sample. Otherwise, in the nonseparable case, the convex surrogate risk itself exhibits distribution-dependent levels of curvature, and consequently the algorithm's output has small norm with high probability.

preprint2013arXiv

Dirichlet draws are sparse with high probability

This note provides an elementary proof of the folklore fact that draws from a Dirichlet distribution (with parameters less than 1) are typically sparse (most coordinates are small).

preprint2013arXiv

Margins, Shrinkage, and Boosting

This manuscript shows that AdaBoost and its immediate variants can produce approximate maximum margin classifiers simply by scaling step size choices with a fixed small constant. In this way, when the unscaled step size is an optimal choice, these results provide guarantees for Friedman's empirically successful "shrinkage" procedure for gradient boosting (Friedman, 2000). Guarantees are also provided for a variety of other step sizes, affirming the intuition that increasingly regularized line searches provide improved margin guarantees. The results hold for the exponential loss and similar losses, most notably the logistic loss.

preprint2013arXiv

Moment-based Uniform Deviation Bounds for $k$-means and Friends

Suppose $k$ centers are fit to $m$ points by heuristically minimizing the $k$-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with $p\geq 4$ bounded moments; in particular, the difference between the sample cost and distribution cost decays with $m$ and $p$ as $m^{\min\{-1/4, -1/2+2/p\}}$. The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of $k$-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for $k$-means instances possessing some cluster structure.

preprint2012arXiv

A Primal-Dual Convergence Analysis of Boosting

Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: - Weak learnability aids the whole loss family: for any ε>0, O(ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; - The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O(ln(1/ε)); - Arbitrary instances may be decomposed into the above two, granting rate O(1/ε), with a matching lower bound provided for the logistic loss.

preprint2012arXiv

Agglomerative Bregman Clustering

This manuscript develops the theory of agglomerative clustering with Bregman divergences. Geometric smoothing techniques are developed to deal with degenerate clusters. To allow for cluster models based on exponential families with overcomplete representations, Bregman divergences are developed for nondifferentiable convex functions.

preprint2012arXiv

Statistical Consistency of Finite-dimensional Unregularized Linear Classification

This manuscript studies statistical properties of linear classifiers obtained through minimization of an unregularized convex risk over a finite sample. Although the results are explicitly finite-dimensional, inputs may be passed through feature maps; in this way, in addition to treating the consistency of logistic regression, this analysis also handles boosting over a finite weak learning class with, for instance, the exponential, logistic, and hinge losses. In this finite-dimensional setting, it is still possible to fit arbitrary decision boundaries: scaling the complexity of the weak learning class with the sample size leads to the optimal classification risk almost surely.

preprint2011arXiv

Blackwell Approachability and Minimax Theory

This manuscript investigates the relationship between Blackwell Approachability, a stochastic vector-valued repeated game, and minimax theory, a single-play scalar-valued scenario. First, it is established in a general setting --- one not permitting invocation of minimax theory --- that Blackwell's Approachability Theorem and its generalization due to Hou are still valid. Second, minimax structure grants a result in the spirit of Blackwell's weak-approachability conjecture, later resolved by Vieille, that any set is either approachable by one player, or avoidable by the opponent. This analysis also reveals a strategy for the opponent.

preprint2010arXiv

Central Binomial Tail Bounds

An alternate form for the binomial tail is presented, which leads to a variety of bounds for the central tail. A few can be weakened into the corresponding Chernoff and Slud bounds, which not only demonstrates the quality of the presented bounds, but also provides alternate proofs for the classical bounds.

Matus Telgarsky

What is connected

Connect this record

See the researcher in context

Building this map preview

23 published item(s)

Basic Inequalities for First-Order Optimization with Applications to Statistical Risk Analysis

Actor-critic is implicitly biased towards high entropy optimal policies

Feature selection with gradient descent on two-layer networks in low-rotation regimes

Stochastic linear optimization never overfits with quadratically-bounded losses on general data

Gradient descent follows the regularization path for general losses

Neural tangent kernels, transportation mappings, and universal approximation

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Benefits of depth in neural networks

Greedy bi-criteria approximations for $k$-medians and $k$-means

Rate of Price Discovery in Iterative Combinatorial Auctions

Convex Risk Minimization and Conditional Probability Estimation

Representation Benefits of Deep Feedforward Networks

Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Tensor decompositions for learning latent variable models

Boosting with the Logistic Loss is Consistent

Dirichlet draws are sparse with high probability

Margins, Shrinkage, and Boosting

Moment-based Uniform Deviation Bounds for $k$-means and Friends

A Primal-Dual Convergence Analysis of Boosting

Agglomerative Bregman Clustering

Statistical Consistency of Finite-dimensional Unregularized Linear Classification

Blackwell Approachability and Minimax Theory

Central Binomial Tail Bounds