Researcher profile

Vladimir Braverman

Vladimir Braverman contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
19works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

19 published item(s)

preprint2026arXiv

AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Reasoning-capable large language models (LLMs) achieve strong performance on complex tasks but often exhibit overthinking after distillation, generating unnecessarily long chain-of-thought (CoT) reasoning even for simple inputs and incurring high inference cost. However, naively shortening reasoning length can degrade reasoning accuracy, as concise reasoning may be insufficient for certain inputs and lacks explicit supervision. We propose Auto Long-Short Reasoning (AutoL2S), a distillation framework that empowers non-reasoning LLMs to think thoroughly but only when necessary. AutoL2S first learns a lightweight switching token with verified long-short CoTs to enable instance-wise long-short reasoning selection. Then it leverages long-short reasoning rollouts induced by a switching token in a GRPO-style loss to improve reasoning efficiency while maintaining accuracy. Experiments demonstrate that AutoL2S effectively reduces reasoning length up to 71% with minimal accuracy loss, yielding markedly better trade-off in token length and inference time while preserving accuracy.

preprint2026arXiv

Online Learning with Limited Information in the Sliding Window Model

Motivated by recent work on the experts problem in the streaming model, we consider the experts problem in the sliding window model. The sliding window model is a well-studied model that captures applications such as traffic monitoring, epidemic tracking, and automated trading, where recent information is more valuable than older data. Formally, we have $n$ experts, $T$ days, the ability to query the predictions of $q$ experts on each day, a limited amount of memory, and should achieve the (near-)optimal regret $\sqrt{nW}\text{polylog}(nT)$ regret over any window of the last $W$ days. While it is impossible to achieve such regret with $1$ query, we show that with $2$ queries we can achieve such regret and with only $\text{polylog}(nT)$ bits of memory. Not only are our algorithms optimal for sliding windows, but we also show for every interval $\mathcal{I}$ of days that we achieve $\sqrt{n|\mathcal{I}|}\text{polylog}(nT)$ regret with $2$ queries and only $\text{polylog}(nT)$ bits of memory, providing an exponential improvement on the memory of previous interval regret algorithms. Building upon these techniques, we address the bandit problem in data streams, where $q=1$, achieving $n T^{2/3}\text{polylog}(T)$ regret with $\text{polylog}(nT)$ memory, which is the first sublinear regret in the streaming model in the bandit setting with polylogarithmic memory; this can be further improved to the optimal $\mathcal{O}(\sqrt{nT})$ regret if the best expert's losses are in a random order.

preprint2022arXiv

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/ε^2)$, where $ε>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $ρ>0$, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/ε\cdot (H^3SA / ρ+ H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $ε$-optimal policy for a post-revealed reward with sub-optimality gap at least $ρ$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of $ε$. We show that, information-theoretically, this bound is nearly tight for $ρ< Θ(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

preprint2022arXiv

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

Stochastic gradient descent (SGD) has been shown to generalize well in many deep learning applications. In practice, one often runs SGD with a geometrically decaying stepsize, i.e., a constant initial stepsize followed by multiple geometric stepsize decay, and uses the last iterate as the output. This kind of SGD is known to be nearly minimax optimal for classical finite-dimensional linear regression problems (Ge et al., 2019). However, a sharp analysis for the last iterate of SGD in the overparameterized setting is still open. In this paper, we provide a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems. In particular, for last iterate SGD with (tail) geometrically decaying stepsize, we prove nearly matching upper and lower bounds on the excess risk. Moreover, we provide an excess risk lower bound for last iterate SGD with polynomially decaying stepsize and demonstrate the advantage of geometrically decaying stepsize in an instance-wise manner, which complements the minimax rate comparison made in prior works.

preprint2022arXiv

New Coresets for Projective Clustering and Applications

$(j,k)$-projective clustering is the natural generalization of the family of $k$-clustering and $j$-subspace clustering problems. Given a set of points $P$ in $\mathbb{R}^d$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure. In this paper, we propose the first algorithm that returns an $L_\infty$ coreset of size polynomial in $d$. Moreover, we give the first strong coreset construction for general $M$-estimator regression. Specifically, we show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_1-L_2$, and Fair regression, as well as general concave and power-bounded loss functions. Finally, we provide experimental results based on real-world datasets, showing the efficacy of our approach.

preprint2022arXiv

Pretrained Models for Multilingual Federated Learning

Since the advent of Federated Learning (FL), research has applied these methods to natural language processing (NLP) tasks. Despite a plethora of papers in FL for NLP, no previous works have studied how multilingual text impacts FL algorithms. Furthermore, multilingual text provides an interesting avenue to examine the impact of non-IID text (e.g. different languages) on FL in naturally occurring data. We explore three multilingual language tasks, language modeling, machine translation, and text classification using differing federated and non-federated learning algorithms. Our results show that using pretrained models reduces the negative effects of FL, helping them to perform near or better than centralized (no privacy) learning, even when using non-IID partitioning.

preprint2022arXiv

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.

preprint2022arXiv

Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations

Continual/lifelong learning from a non-stationary input data stream is a cornerstone of intelligence. Despite their phenomenal performance in a wide variety of applications, deep neural networks are prone to forgetting their previously learned information upon learning new ones. This phenomenon is called &#34;catastrophic forgetting&#34; and is deeply rooted in the stability-plasticity dilemma. Overcoming catastrophic forgetting in deep neural networks has become an active field of research in recent years. In particular, gradient projection-based methods have recently shown exceptional performance at overcoming catastrophic forgetting. This paper proposes two biologically-inspired mechanisms based on sparsity and heterogeneous dropout that significantly increase a continual learner&#39;s performance over a long sequence of tasks. Our proposed approach builds on the Gradient Projection Memory (GPM) framework. We leverage k-winner activations in each layer of a neural network to enforce layer-wise sparse activations for each task, together with a between-task heterogeneous dropout that encourages the network to use non-overlapping activation patterns between different tasks. In addition, we introduce two new benchmarks for continual learning under distributional shift, namely Continual Swiss Roll and ImageNet SuperDog-40. Lastly, we provide an in-depth analysis of our proposed method and demonstrate a significant performance boost on various benchmark continual learning problems.

preprint2022arXiv

Sublinear Time Spectral Density Estimation

We present a new sublinear time algorithm for approximating the spectral density (eigenvalue distribution) of an $n\times n$ normalized graph adjacency or Laplacian matrix. The algorithm recovers the spectrum up to $ε$ accuracy in the Wasserstein-1 distance in $O(n\cdot \text{poly}(1/ε))$ time given sample access to the graph. This result compliments recent work by David Cohen-Steiner, Weihao Kong, Christian Sohler, and Gregory Valiant (2018), which obtains a solution with runtime independent of $n$, but exponential in $1/ε$. We conjecture that the trade-off between dimension dependence and accuracy is inherent. Our method is simple and works well experimentally. It is based on a Chebyshev polynomial moment matching method that employees randomized estimators for the matrix trace. We prove that, for any Hermitian $A$, this moment matching method returns an $ε$ approximation to the spectral density using just $O({1}/ε)$ matrix-vector products with $A$. By leveraging stability properties of the Chebyshev polynomial three-term recurrence, we then prove that the method is amenable to the use of coarse approximate matrix-vector products. Our sublinear time algorithm follows from combining this result with a novel sampling algorithm for approximating matrix-vector products with a normalized graph adjacency matrix. Of independent interest, we show a similar result for the widely used \emph{kernel polynomial method} (KPM), proving that this practical algorithm nearly matches the theoretical guarantees of our moment matching method. Our analysis uses tools from Jackson&#39;s seminal work on approximation with positive polynomial kernels.

preprint2022arXiv

The Benefits of Implicit Regularization from SGD in Least Squares Problems

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.

preprint2022arXiv

The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted by online SGD) for this problem. We establish sharp instance-dependent excess risk upper and lower bounds for this approach. Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data. In addition, we show that finetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining. Our theory sheds light on the effectiveness and limitation of pretraining as well as the benefits of finetuning for tackling covariate shift problems.

preprint2020arXiv

Communication-efficient distributed SGD with Sketching

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce Sketched SGD, an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. We show that Sketched SGD has favorable convergence rates on several classes of functions. When considering all communication -- both of gradients and of updated model weights -- Sketched SGD reduces the amount of communication required compared to other gradient compression methods from $\mathcal{O}(d)$ or $\mathcal{O}(W)$ to $\mathcal{O}(\log d)$, where $d$ is the number of model parameters and $W$ is the number of workers participating in training. We run experiments on a transformer model, an LSTM, and a residual network, demonstrating up to a 40x reduction in total communication cost with no loss in final model performance. We also show experimentally that Sketched SGD scales to at least 256 workers without increasing communication cost or degrading model performance.

preprint2020arXiv

Coresets for Clustering in Excluded-minor Graphs and Beyond

Coresets are modern data-reduction tools that are widely used in data analysis to improve efficiency in terms of running time, space and communication complexity. Our main result is a fast algorithm to construct a small coreset for k-Median in (the shortest-path metric of) an excluded-minor graph. Specifically, we give the first coreset of size that depends only on $k$, $ε$ and the excluded-minor size, and our running time is quasi-linear (in the size of the input graph). The main innovation in our new algorithm is that is iterative; it first reduces the $n$ input points to roughly $O(\log n)$ reweighted points, then to $O(\log\log n)$, and so forth until the size is independent of $n$. Each step in this iterative size reduction is based on the importance sampling framework of Feldman and Langberg (STOC 2011), with a crucial adaptation that reduces the number of \emph{distinct points}, by employing a terminal embedding (where low distortion is guaranteed only for the distance from every terminal to all other points). Our terminal embedding is technically involved and relies on shortest-path separators, a standard tool in planar and excluded-minor graphs. Furthermore, our new algorithm is applicable also in Euclidean metrics, by simply using a recent terminal embedding result of Narayanan and Nelson, (STOC 2019), which extends the Johnson-Lindenstrauss Lemma. We thus obtain an efficient coreset construction in high-dimensional Euclidean spaces, thereby matching and simplifying state-of-the-art results (Sohler and Woodruff, FOCS 2018; Huang and Vishnoi, STOC 2020). In addition, we also employ terminal embedding with additive distortion to obtain small coresets in graphs with bounded highway dimension, and use applications of our coresets to obtain improved approximation schemes, e.g., an improved PTAS for planar k-Median via a new centroid set.

preprint2020arXiv

Data-Independent Neural Pruning via Coresets

Previous work showed empirically that large neural networks can be significantly reduced in size while preserving their accuracy. Model compression became a central research topic, as it is crucial for deployment of neural networks on devices with limited computational and memory resources. The majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample. We propose the first efficient, data-independent neural pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework, which finds a small weighted subset of points that provably approximates the original inputs. Specifically, we approximate the output of a layer of neurons by a coreset of neurons in the previous layer and discard the rest. We apply this framework in a layer-by-layer fashion from the top to the bottom. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one. We demonstrate the effectiveness of our method on popular network architectures. In particular, our coresets yield 90\% compression of the LeNet-300-100 architecture on MNIST while improving the accuracy.

preprint2020arXiv

Data-Independent Structured Pruning of Neural Networks via Coresets

Model compression is crucial for deployment of neural networks on devices with limited computational and memory resources. Many different methods show comparable accuracy of the compressed model and similar compression rates. However, the majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample. We propose the first efficient structured pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework and it approximates the output of a layer of neurons/filters by a coreset of neurons/filters in the previous layer and discards the rest. We apply this framework in a layer-by-layer fashion from the bottom to the top. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one.

preprint2020arXiv

Obtaining Adjustable Regularization for Free via Iterate Averaging

Regularization for optimization is a crucial technique to avoid overfitting in machine learning. In order to obtain the best performance, we usually train a model by tuning the regularization parameters. It becomes costly, however, when a single round of training takes significant amount of time. Very recently, Neu and Rosasco show that if we run stochastic gradient descent (SGD) on linear regression problems, then by averaging the SGD iterates properly, we obtain a regularized solution. It left open whether the same phenomenon can be achieved for other optimization problems and algorithms. In this paper, we establish an averaging scheme that provably converts the iterates of SGD on an arbitrary strongly convex and smooth objective function to its regularized counterpart with an adjustable regularization parameter. Our approaches can be used for accelerated and preconditioned optimization methods as well. We further show that the same methods work empirically on more general optimization objectives including neural networks. In sum, we obtain adjustable regularization for free for a large class of optimization problems and resolve an open question raised by Neu and Rosasco.

preprint2020arXiv

On the Noisy Gradient Descent that Generalizes as SGD

The gradient noise of SGD is considered to play a central role in the observed strong generalization abilities of deep learning. While past studies confirm that the magnitude and the covariance structure of gradient noise are critical for regularization, it remains unclear whether or not the class of noise distributions is important. In this work we provide negative results by showing that noises in classes different from the SGD noise can also effectively regularize gradient descent. Our finding is based on a novel observation on the structure of the SGD noise: it is the multiplication of the gradient matrix and a sampling noise that arises from the mini-batch sampling procedure. Moreover, the sampling noises unify two kinds of gradient regularizing noises that belong to the Gaussian class: the one using (scaled) Fisher as covariance and the one using the gradient covariance of SGD as covariance. Finally, thanks to the flexibility of choosing noise class, an algorithm is proposed to perform noisy gradient descent that generalizes well, the variant of which even benefits large batch SGD training without hurting generalization.

preprint2020arXiv

Schatten Norms in Matrix Streams: Hello Sparsity, Goodbye Dimension

Spectral functions of large matrices contains important structural information about the underlying data, and is thus becoming increasingly important. Many times, large matrices representing real-world data are \emph{sparse} or \emph{doubly sparse} (i.e., sparse in both rows and columns), and are accessed as a \emph{stream} of updates, typically organized in \emph{row-order}. In this setting, where space (memory) is the limiting resource, all known algorithms require space that is polynomial in the dimension of the matrix, even for sparse matrices. We address this challenge by providing the first algorithms whose space requirement is \emph{independent of the matrix dimension}, assuming the matrix is doubly-sparse and presented in row-order. Our algorithms approximate the Schatten $p$-norms, which we use in turn to approximate other spectral functions, such as logarithm of the determinant, trace of matrix inverse, and Estrada index. We validate these theoretical performance bounds by numerical experiments on real-world matrices representing social networks. We further prove that multiple passes are unavoidable in this setting, and show extensions of our primary technique, including a trade-off between space requirements and number of passes.

preprint2020arXiv

Universal Streaming of Subset Norms

Most known algorithms in the streaming model of computation aim to approximate a single function such as an $\ell_p$-norm. In 2009, Nelson [\url{https://sublinear.info}, Open Problem 30] asked if it possible to design \emph{universal algorithms}, that simultaneously approximate multiple functions of the stream. In this paper we answer the question of Nelson for the class of \emph{subset $\ell_0$-norms} in the insertion-only frequency-vector model. Given a family of subsets $\mathcal{S}\subset 2^{[n]}$, we provide a single streaming algorithm that can $(1\pm ε)$-approximate the subset-norm for every $S\in\mathcal{S}$. Here, the subset-$\ell_p$-norm of $v\in \mathbb{R}^n$ with respect to set $S\subseteq [n]$ is the $\ell_p$-norm of vector $v_{|S}$ (which denotes restricting $v$ to $S$, by zeroing all other coordinates). Our main result is a near-tight characterization of the space complexity of every family $\mathcal{S}\subset 2^{[n]}$ of subset-$\ell_0$-norms in insertion-only streams, expressed in terms of the &#34;heavy-hitter dimension&#34; of $\mathcal{S}$, a new combinatorial quantity that is related to the VC-dimension of $\mathcal{S}$. In contrast, we show that the more general turnstile and sliding-window models require a much larger space usage. All these results easily extend to $\ell_1$. In addition, we design algorithms for two other subset-$\ell_p$-norm variants. These can be compared to the Priority Sampling algorithm of Duffield, Lund and Thorup [JACM 2007], which achieves additive approximation $ε\|{v}\|$ for all possible subsets ($\mathcal{S}=2^{[n]}$) in the entry-wise update model. One of our algorithms extends this algorithm to handle turnstile updates, and another one achieves multiplicative approximation given a family $\mathcal{S}$.