Source author record

Vladimir Braverman

Vladimir Braverman appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning math.OC Artificial Intelligence Computation and Language Computer Vision Distributed, Parallel, and Cluster Computing math.NA Numerical Analysis

Catalog footprint

What is connected

33works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Reasoning-capable large language models (LLMs) achieve strong performance on complex tasks but often exhibit overthinking after distillation, generating unnecessarily long chain-of-thought (CoT) reasoning even for simple inputs and incurring high inference cost. However, naively shortening reasoning length can degrade reasoning accuracy, as concise reasoning may be insufficient for certain inputs and lacks explicit supervision. We propose Auto Long-Short Reasoning (AutoL2S), a distillation framework that empowers non-reasoning LLMs to think thoroughly but only when necessary. AutoL2S first learns a lightweight switching token with verified long-short CoTs to enable instance-wise long-short reasoning selection. Then it leverages long-short reasoning rollouts induced by a switching token in a GRPO-style loss to improve reasoning efficiency while maintaining accuracy. Experiments demonstrate that AutoL2S effectively reduces reasoning length up to 71% with minimal accuracy loss, yielding markedly better trade-off in token length and inference time while preserving accuracy.

preprint2026arXiv

Online Learning with Limited Information in the Sliding Window Model

Motivated by recent work on the experts problem in the streaming model, we consider the experts problem in the sliding window model. The sliding window model is a well-studied model that captures applications such as traffic monitoring, epidemic tracking, and automated trading, where recent information is more valuable than older data. Formally, we have $n$ experts, $T$ days, the ability to query the predictions of $q$ experts on each day, a limited amount of memory, and should achieve the (near-)optimal regret $\sqrt{nW}\text{polylog}(nT)$ regret over any window of the last $W$ days. While it is impossible to achieve such regret with $1$ query, we show that with $2$ queries we can achieve such regret and with only $\text{polylog}(nT)$ bits of memory. Not only are our algorithms optimal for sliding windows, but we also show for every interval $\mathcal{I}$ of days that we achieve $\sqrt{n|\mathcal{I}|}\text{polylog}(nT)$ regret with $2$ queries and only $\text{polylog}(nT)$ bits of memory, providing an exponential improvement on the memory of previous interval regret algorithms. Building upon these techniques, we address the bandit problem in data streams, where $q=1$, achieving $n T^{2/3}\text{polylog}(T)$ regret with $\text{polylog}(nT)$ memory, which is the first sublinear regret in the streaming model in the bandit setting with polylogarithmic memory; this can be further improved to the optimal $\mathcal{O}(\sqrt{nT})$ regret if the best expert's losses are in a random order.

preprint2022arXiv

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/ε^2)$, where $ε>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $ρ>0$, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/ε\cdot (H^3SA / ρ+ H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $ε$-optimal policy for a post-revealed reward with sub-optimality gap at least $ρ$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of $ε$. We show that, information-theoretically, this bound is nearly tight for $ρ< Θ(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

preprint2022arXiv

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

Stochastic gradient descent (SGD) has been shown to generalize well in many deep learning applications. In practice, one often runs SGD with a geometrically decaying stepsize, i.e., a constant initial stepsize followed by multiple geometric stepsize decay, and uses the last iterate as the output. This kind of SGD is known to be nearly minimax optimal for classical finite-dimensional linear regression problems (Ge et al., 2019). However, a sharp analysis for the last iterate of SGD in the overparameterized setting is still open. In this paper, we provide a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems. In particular, for last iterate SGD with (tail) geometrically decaying stepsize, we prove nearly matching upper and lower bounds on the excess risk. Moreover, we provide an excess risk lower bound for last iterate SGD with polynomially decaying stepsize and demonstrate the advantage of geometrically decaying stepsize in an instance-wise manner, which complements the minimax rate comparison made in prior works.

preprint2022arXiv

New Coresets for Projective Clustering and Applications

$(j,k)$-projective clustering is the natural generalization of the family of $k$-clustering and $j$-subspace clustering problems. Given a set of points $P$ in $\mathbb{R}^d$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure. In this paper, we propose the first algorithm that returns an $L_\infty$ coreset of size polynomial in $d$. Moreover, we give the first strong coreset construction for general $M$-estimator regression. Specifically, we show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_1-L_2$, and Fair regression, as well as general concave and power-bounded loss functions. Finally, we provide experimental results based on real-world datasets, showing the efficacy of our approach.

preprint2022arXiv

Pretrained Models for Multilingual Federated Learning

Since the advent of Federated Learning (FL), research has applied these methods to natural language processing (NLP) tasks. Despite a plethora of papers in FL for NLP, no previous works have studied how multilingual text impacts FL algorithms. Furthermore, multilingual text provides an interesting avenue to examine the impact of non-IID text (e.g. different languages) on FL in naturally occurring data. We explore three multilingual language tasks, language modeling, machine translation, and text classification using differing federated and non-federated learning algorithms. Our results show that using pretrained models reduces the negative effects of FL, helping them to perform near or better than centralized (no privacy) learning, even when using non-IID partitioning.

preprint2022arXiv

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.

preprint2022arXiv

Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations

Continual/lifelong learning from a non-stationary input data stream is a cornerstone of intelligence. Despite their phenomenal performance in a wide variety of applications, deep neural networks are prone to forgetting their previously learned information upon learning new ones. This phenomenon is called "catastrophic forgetting" and is deeply rooted in the stability-plasticity dilemma. Overcoming catastrophic forgetting in deep neural networks has become an active field of research in recent years. In particular, gradient projection-based methods have recently shown exceptional performance at overcoming catastrophic forgetting. This paper proposes two biologically-inspired mechanisms based on sparsity and heterogeneous dropout that significantly increase a continual learner's performance over a long sequence of tasks. Our proposed approach builds on the Gradient Projection Memory (GPM) framework. We leverage k-winner activations in each layer of a neural network to enforce layer-wise sparse activations for each task, together with a between-task heterogeneous dropout that encourages the network to use non-overlapping activation patterns between different tasks. In addition, we introduce two new benchmarks for continual learning under distributional shift, namely Continual Swiss Roll and ImageNet SuperDog-40. Lastly, we provide an in-depth analysis of our proposed method and demonstrate a significant performance boost on various benchmark continual learning problems.

preprint2022arXiv

Sublinear Time Spectral Density Estimation

We present a new sublinear time algorithm for approximating the spectral density (eigenvalue distribution) of an $n\times n$ normalized graph adjacency or Laplacian matrix. The algorithm recovers the spectrum up to $ε$ accuracy in the Wasserstein-1 distance in $O(n\cdot \text{poly}(1/ε))$ time given sample access to the graph. This result compliments recent work by David Cohen-Steiner, Weihao Kong, Christian Sohler, and Gregory Valiant (2018), which obtains a solution with runtime independent of $n$, but exponential in $1/ε$. We conjecture that the trade-off between dimension dependence and accuracy is inherent. Our method is simple and works well experimentally. It is based on a Chebyshev polynomial moment matching method that employees randomized estimators for the matrix trace. We prove that, for any Hermitian $A$, this moment matching method returns an $ε$ approximation to the spectral density using just $O({1}/ε)$ matrix-vector products with $A$. By leveraging stability properties of the Chebyshev polynomial three-term recurrence, we then prove that the method is amenable to the use of coarse approximate matrix-vector products. Our sublinear time algorithm follows from combining this result with a novel sampling algorithm for approximating matrix-vector products with a normalized graph adjacency matrix. Of independent interest, we show a similar result for the widely used \emph{kernel polynomial method} (KPM), proving that this practical algorithm nearly matches the theoretical guarantees of our moment matching method. Our analysis uses tools from Jackson's seminal work on approximation with positive polynomial kernels.

preprint2022arXiv

The Benefits of Implicit Regularization from SGD in Least Squares Problems

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.

preprint2022arXiv

The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted by online SGD) for this problem. We establish sharp instance-dependent excess risk upper and lower bounds for this approach. Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data. In addition, we show that finetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining. Our theory sheds light on the effectiveness and limitation of pretraining as well as the benefits of finetuning for tackling covariate shift problems.

preprint2020arXiv

Communication-efficient distributed SGD with Sketching

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce Sketched SGD, an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. We show that Sketched SGD has favorable convergence rates on several classes of functions. When considering all communication -- both of gradients and of updated model weights -- Sketched SGD reduces the amount of communication required compared to other gradient compression methods from $\mathcal{O}(d)$ or $\mathcal{O}(W)$ to $\mathcal{O}(\log d)$, where $d$ is the number of model parameters and $W$ is the number of workers participating in training. We run experiments on a transformer model, an LSTM, and a residual network, demonstrating up to a 40x reduction in total communication cost with no loss in final model performance. We also show experimentally that Sketched SGD scales to at least 256 workers without increasing communication cost or degrading model performance.

preprint2020arXiv

Coresets for Clustering in Excluded-minor Graphs and Beyond

Coresets are modern data-reduction tools that are widely used in data analysis to improve efficiency in terms of running time, space and communication complexity. Our main result is a fast algorithm to construct a small coreset for k-Median in (the shortest-path metric of) an excluded-minor graph. Specifically, we give the first coreset of size that depends only on $k$, $ε$ and the excluded-minor size, and our running time is quasi-linear (in the size of the input graph). The main innovation in our new algorithm is that is iterative; it first reduces the $n$ input points to roughly $O(\log n)$ reweighted points, then to $O(\log\log n)$, and so forth until the size is independent of $n$. Each step in this iterative size reduction is based on the importance sampling framework of Feldman and Langberg (STOC 2011), with a crucial adaptation that reduces the number of \emph{distinct points}, by employing a terminal embedding (where low distortion is guaranteed only for the distance from every terminal to all other points). Our terminal embedding is technically involved and relies on shortest-path separators, a standard tool in planar and excluded-minor graphs. Furthermore, our new algorithm is applicable also in Euclidean metrics, by simply using a recent terminal embedding result of Narayanan and Nelson, (STOC 2019), which extends the Johnson-Lindenstrauss Lemma. We thus obtain an efficient coreset construction in high-dimensional Euclidean spaces, thereby matching and simplifying state-of-the-art results (Sohler and Woodruff, FOCS 2018; Huang and Vishnoi, STOC 2020). In addition, we also employ terminal embedding with additive distortion to obtain small coresets in graphs with bounded highway dimension, and use applications of our coresets to obtain improved approximation schemes, e.g., an improved PTAS for planar k-Median via a new centroid set.

preprint2020arXiv

Data-Independent Neural Pruning via Coresets

Previous work showed empirically that large neural networks can be significantly reduced in size while preserving their accuracy. Model compression became a central research topic, as it is crucial for deployment of neural networks on devices with limited computational and memory resources. The majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample. We propose the first efficient, data-independent neural pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework, which finds a small weighted subset of points that provably approximates the original inputs. Specifically, we approximate the output of a layer of neurons by a coreset of neurons in the previous layer and discard the rest. We apply this framework in a layer-by-layer fashion from the top to the bottom. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one. We demonstrate the effectiveness of our method on popular network architectures. In particular, our coresets yield 90\% compression of the LeNet-300-100 architecture on MNIST while improving the accuracy.

preprint2020arXiv

Data-Independent Structured Pruning of Neural Networks via Coresets

Model compression is crucial for deployment of neural networks on devices with limited computational and memory resources. Many different methods show comparable accuracy of the compressed model and similar compression rates. However, the majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample. We propose the first efficient structured pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework and it approximates the output of a layer of neurons/filters by a coreset of neurons/filters in the previous layer and discards the rest. We apply this framework in a layer-by-layer fashion from the bottom to the top. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one.

preprint2020arXiv

Obtaining Adjustable Regularization for Free via Iterate Averaging

Regularization for optimization is a crucial technique to avoid overfitting in machine learning. In order to obtain the best performance, we usually train a model by tuning the regularization parameters. It becomes costly, however, when a single round of training takes significant amount of time. Very recently, Neu and Rosasco show that if we run stochastic gradient descent (SGD) on linear regression problems, then by averaging the SGD iterates properly, we obtain a regularized solution. It left open whether the same phenomenon can be achieved for other optimization problems and algorithms. In this paper, we establish an averaging scheme that provably converts the iterates of SGD on an arbitrary strongly convex and smooth objective function to its regularized counterpart with an adjustable regularization parameter. Our approaches can be used for accelerated and preconditioned optimization methods as well. We further show that the same methods work empirically on more general optimization objectives including neural networks. In sum, we obtain adjustable regularization for free for a large class of optimization problems and resolve an open question raised by Neu and Rosasco.

preprint2020arXiv

On the Noisy Gradient Descent that Generalizes as SGD

The gradient noise of SGD is considered to play a central role in the observed strong generalization abilities of deep learning. While past studies confirm that the magnitude and the covariance structure of gradient noise are critical for regularization, it remains unclear whether or not the class of noise distributions is important. In this work we provide negative results by showing that noises in classes different from the SGD noise can also effectively regularize gradient descent. Our finding is based on a novel observation on the structure of the SGD noise: it is the multiplication of the gradient matrix and a sampling noise that arises from the mini-batch sampling procedure. Moreover, the sampling noises unify two kinds of gradient regularizing noises that belong to the Gaussian class: the one using (scaled) Fisher as covariance and the one using the gradient covariance of SGD as covariance. Finally, thanks to the flexibility of choosing noise class, an algorithm is proposed to perform noisy gradient descent that generalizes well, the variant of which even benefits large batch SGD training without hurting generalization.

preprint2020arXiv

Schatten Norms in Matrix Streams: Hello Sparsity, Goodbye Dimension

Spectral functions of large matrices contains important structural information about the underlying data, and is thus becoming increasingly important. Many times, large matrices representing real-world data are \emph{sparse} or \emph{doubly sparse} (i.e., sparse in both rows and columns), and are accessed as a \emph{stream} of updates, typically organized in \emph{row-order}. In this setting, where space (memory) is the limiting resource, all known algorithms require space that is polynomial in the dimension of the matrix, even for sparse matrices. We address this challenge by providing the first algorithms whose space requirement is \emph{independent of the matrix dimension}, assuming the matrix is doubly-sparse and presented in row-order. Our algorithms approximate the Schatten $p$-norms, which we use in turn to approximate other spectral functions, such as logarithm of the determinant, trace of matrix inverse, and Estrada index. We validate these theoretical performance bounds by numerical experiments on real-world matrices representing social networks. We further prove that multiple passes are unavoidable in this setting, and show extensions of our primary technique, including a trade-off between space requirements and number of passes.

preprint2020arXiv

Universal Streaming of Subset Norms

Most known algorithms in the streaming model of computation aim to approximate a single function such as an $\ell_p$-norm. In 2009, Nelson [\url{https://sublinear.info}, Open Problem 30] asked if it possible to design \emph{universal algorithms}, that simultaneously approximate multiple functions of the stream. In this paper we answer the question of Nelson for the class of \emph{subset $\ell_0$-norms} in the insertion-only frequency-vector model. Given a family of subsets $\mathcal{S}\subset 2^{[n]}$, we provide a single streaming algorithm that can $(1\pm ε)$-approximate the subset-norm for every $S\in\mathcal{S}$. Here, the subset-$\ell_p$-norm of $v\in \mathbb{R}^n$ with respect to set $S\subseteq [n]$ is the $\ell_p$-norm of vector $v_{|S}$ (which denotes restricting $v$ to $S$, by zeroing all other coordinates). Our main result is a near-tight characterization of the space complexity of every family $\mathcal{S}\subset 2^{[n]}$ of subset-$\ell_0$-norms in insertion-only streams, expressed in terms of the "heavy-hitter dimension" of $\mathcal{S}$, a new combinatorial quantity that is related to the VC-dimension of $\mathcal{S}$. In contrast, we show that the more general turnstile and sliding-window models require a much larger space usage. All these results easily extend to $\ell_1$. In addition, we design algorithms for two other subset-$\ell_p$-norm variants. These can be compared to the Priority Sampling algorithm of Duffield, Lund and Thorup [JACM 2007], which achieves additive approximation $ε\|{v}\|$ for all possible subsets ($\mathcal{S}=2^{[n]}$) in the entry-wise update model. One of our algorithms extends this algorithm to handle turnstile updates, and another one achieves multiplicative approximation given a family $\mathcal{S}$.

preprint2016arXiv

Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors

A central problem in the theory of algorithms for data streams is to determine which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space. Given a function $g$, we study the space complexity of approximating $\sum_{i=1}^n g(|f_i|)$, where $f\in\mathbb{Z}^n$ is the frequency vector of a turnstile stream. This is a generalization of the well-known frequency moments problem, and previous results apply only when $g$ is monotonic or has a special functional form. Our contribution is to give a condition such that, except for a narrow class of functions $g$, there is a space-efficient approximation algorithm for the sum if and only if $g$ satisfies the condition. The functions $g$ that we are able to characterize include all convex, concave, monotonic, polynomial, and trigonometric functions, among many others, and is the first such characterization for non-monotonic functions. Thus, for nearly all functions of one variable, we answer the open question from the celebrated paper of Alon, Matias and Szegedy (1996).

preprint2015arXiv

A Unified Approach for Clustering Problems on Sliding Windows

We explore clustering problems in the streaming sliding window model in both general metric spaces and Euclidean space. We present the first polylogarithmic space $O(1)$-approximation to the metric $k$-median and metric $k$-means problems in the sliding window model, answering the main open problem posed by Babcock, Datar, Motwani and O'Callaghan, which has remained unanswered for over a decade. Our algorithm uses $O(k^3 \log^6 n)$ space and $\operatorname{poly}(k, \log n)$ update time. This is an exponential improvement on the space required by the technique due to Babcock, et al. We introduce a data structure that extends smooth histograms as introduced by Braverman and Ostrovsky to operate on a broader class of functions. In particular, we show that using only polylogarithmic space we can maintain a summary of the current window from which we can construct an $O(1)$-approximate clustering solution. Merge-and-reduce is a generic method in computational geometry for adapting offline algorithms to the insertion-only streaming model. Several well-known coreset constructions are maintainable in the insertion-only streaming model using this method, including well-known coreset techniques for the $k$-median, $k$-means in both low-and high-dimensional Euclidean spaces. Previous work has adapted these techniques to the insertion-deletion model, but translating them to the sliding window model has remained a challenge. We give the first algorithm that, given an insertion-only streaming coreset construction of space $s$, maintains a $(1\pmε)$-approximate coreset in the sliding window model using $O(s^2ε^{-2}\log n)$ space. For clustering problems, our results constitute the first significant step towards resolving problem number 20 from the List of Open Problems in Sublinear Algorithms.

preprint2015arXiv

Approximating Subadditive Hadamard Functions on Implicit Matrices

An important challenge in the streaming model is to maintain small-space approximations of entrywise functions performed on a matrix that is generated by the outer product of two vectors given as a stream. In other works, streams typically define matrices in a standard way via a sequence of updates, as in the work of Woodruff (2014) and others. We describe the matrix formed by the outer product, and other matrices that do not fall into this category, as implicit matrices. As such, we consider the general problem of computing over such implicit matrices with Hadamard functions, which are functions applied entrywise on a matrix. In this paper, we apply this generalization to provide new techniques for identifying independence between two vectors in the streaming model. The previous state of the art algorithm of Braverman and Ostrovsky (2010) gave a $(1 \pm ε)$-approximation for the $L_1$ distance between the product and joint distributions, using space $O(\log^{1024}(nm) ε^{-1024})$, where $m$ is the length of the stream and $n$ denotes the size of the universe from which stream elements are drawn. Our general techniques include the $L_1$ distance as a special case, and we give an improved space bound of $O(\log^{12}(n) \log^{2}({nm \over ε})ε^{-7})$.

preprint2015arXiv

Beating CountSketch for Heavy Hitters in Insertion Streams

Given a stream $p_1, \ldots, p_m$ of items from a universe $\mathcal{U}$, which, without loss of generality we identify with the set of integers $\{1, 2, \ldots, n\}$, we consider the problem of returning all $\ell_2$-heavy hitters, i.e., those items $j$ for which $f_j \geq ε\sqrt{F_2}$, where $f_j$ is the number of occurrences of item $j$ in the stream, and $F_2 = \sum_{i \in [n]} f_i^2$. Such a guarantee is considerably stronger than the $\ell_1$-guarantee, which finds those $j$ for which $f_j \geq εm$. In 2002, Charikar, Chen, and Farach-Colton suggested the {\sf CountSketch} data structure, which finds all such $j$ using $Θ(\log^2 n)$ bits of space (for constant $ε> 0$). The only known lower bound is $Ω(\log n)$ bits of space, which comes from the need to specify the identities of the items found. In this paper we show it is possible to achieve $O(\log n \log \log n)$ bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including (1) The first algorithm for estimating $F_2$ simultaneously at all points in a stream using only $O(\log n\log\log n)$ bits of space, improving a natural union bound and the algorithm of Huang, Tai, and Yi (2014). (2) A way to estimate the $\ell_{\infty}$ norm of a stream up to additive error $ε\sqrt{F_2}$ with $O(\log n\log\log n)$ bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams.

preprint2015arXiv

Universal sketches for the frequency negative moments and other decreasing streaming sums

Given a stream with frequencies $f_d$, for $d\in[n]$, we characterize the space necessary for approximating the frequency negative moments $F_p=\sum |f_d|^p$, where $p<0$ and the sum is taken over all items $d\in[n]$ with nonzero frequency, in terms of $n$, $ε$, and $m=\sum |f_d|$. To accomplish this, we actually prove a much more general result. Given any nonnegative and nonincreasing function $g$, we characterize the space necessary for any streaming algorithm that outputs a $(1\pmε)$-approximation to $\sum g(|f_d|)$, where again the sum is over items with nonzero frequency. The storage required is expressed in the form of the solution to a relatively simple nonlinear optimization problem, and the algorithm is universal for $(1\pmε)$-approximations to any such sum where the applied function is nonnegative, nonincreasing, and has the same or smaller space complexity as $g$. This partially answers an open question of Nelson (IITK Workshop Kanpur, 2009).

preprint2015arXiv

Weighted Sampling Without Replacement from Data Streams

Weighted sampling without replacement has proved to be a very important tool in designing new algorithms. Efraimidis and Spirakis (IPL 2006) presented an algorithm for weighted sampling without replacement from data streams. Their algorithm works under the assumption of precise computations over the interval [0,1]. Cohen and Kaplan (VLDB 2008) used similar methods for their bottom-k sketches. Efraimidis and Spirakis ask as an open question whether using finite precision arithmetic impacts the accuracy of their algorithm. In this paper we show a method to avoid this problem by providing a precise reduction from k-sampling without replacement to k-sampling with replacement. We call the resulting method Cascade Sampling.

preprint2014arXiv

Approximating Large Frequency Moments with $O(n^{1-2/k})$ Bits

In this paper we consider the problem of approximating frequency moments in the streaming model. Given a stream $D = \{p_1,p_2,\dots,p_m\}$ of numbers from $\{1,\dots, n\}$, a frequency of $i$ is defined as $f_i = |\{j: p_j = i\}|$. The $k$-th \emph{frequency moment} of $D$ is defined as $F_k = \sum_{i=1}^n f_i^k$. In this paper we give an upper bound on the space required to find a $k$-th frequency moment of $O(n^{1-2/k})$ bits that matches, up to a constant factor, the lower bound of Woodruff and Zhang (STOC 12) for constant $ε$ and constant $k$. Our algorithm makes a single pass over the stream and works for any constant $k > 3$.

preprint2014arXiv

Universal Streaming

Given a stream of data, a typical approach in streaming algorithms is to design a sophisticated algorithm with small memory that computes a specific statistic over the streaming data. Usually, if one wants to compute a different statistic after the stream is gone, it is impossible. But what if we want to compute a different statistic after the fact? In this paper, we consider the following fascinating possibility: can we collect some small amount of specific data during the stream that is "universal," i.e., where we do not know anything about the statistics we will want to later compute, other than the guarantee that had we known the statistic ahead of time, it would have been possible to do so with small memory? In other words, is it possible to collect some data in small space during the stream, such that any other statistic that can be computed with comparable space can be computed after the fact? This is indeed what we introduce (and show) in this paper with matching upper and lower bounds: we show that it is possible to collect universal statistics of polylogarithmic size, and prove that these universal statistics allow us after the fact to compute all other statistics that are computable with similar amounts of memory. We show that this is indeed possible, both for the standard unbounded streaming model and the sliding window streaming model.

preprint2013arXiv

How Hard is Counting Triangles in the Streaming Model

The problem of (approximately) counting the number of triangles in a graph is one of the basic problems in graph theory. In this paper we study the problem in the streaming model. We study the amount of memory required by a randomized algorithm to solve this problem. In case the algorithm is allowed one pass over the stream, we present a best possible lower bound of $Ω(m)$ for graphs $G$ with $m$ edges on $n$ vertices. If a constant number of passes is allowed, we show a lower bound of $Ω(m/T)$, $T$ the number of triangles. We match, in some sense, this lower bound with a 2-pass $O(m/T^{1/3})$-memory algorithm that solves the problem of distinguishing graphs with no triangles from graphs with at least $T$ triangles. We present a new graph parameter $ρ(G)$ -- the triangle density, and conjecture that the space complexity of the triangles problem is $Ω(m/ρ(G))$. We match this by a second algorithm that solves the distinguishing problem using $O(m/ρ(G))$-memory.

preprint2013arXiv

How to Catch L_2-Heavy-Hitters on Sliding Windows

Finding heavy-elements (heavy-hitters) in streaming data is one of the central, and well-understood tasks. Despite the importance of this problem, when considering the sliding windows model of streaming (where elements eventually expire) the problem of finding L_2-heavy elements has remained completely open despite multiple papers and considerable success in finding L_1-heavy elements. In this paper, we develop the first poly-logarithmic-memory algorithm for finding L_2-heavy elements in sliding window model. Since L_2 heavy elements play a central role for many fundamental streaming problems (such as frequency moments), we believe our method would be extremely useful for many sliding-windows algorithms and applications. For example, our technique allows us not only to find L_2-heavy elements, but also heavy elements with respect to any L_p for 0<p<2 on sliding windows. Thus, our paper completely resolves the question of finding L_p-heavy elements for sliding windows with poly-logarithmic memory for all values of p since it is well known that for p>2 this task is impossible. Our method may have other applications as well. We demonstrate a broader applicability of our novel yet simple method on two additional examples: we show how to obtain a sliding window approximation of other properties such as the similarity of two streams, or the fraction of elements that appear exactly a specified number of times within the window (the rarity problem). In these two illustrative examples of our method, we replace the current expected memory bounds with worst case bounds.

preprint2012arXiv

Approximating Large Frequency Moments with Pick-and-Drop Sampling

Given data stream $D = \{p_1,p_2,...,p_m\}$ of size $m$ of numbers from $\{1,..., n\}$, the frequency of $i$ is defined as $f_i = |\{j: p_j = i\}|$. The $k$-th \emph{frequency moment} of $D$ is defined as $F_k = \sum_{i=1}^n f_i^k$. We consider the problem of approximating frequency moments in insertion-only streams for $k\ge 3$. For any constant $c$ we show an $O(n^{1-2/k}\log(n)\log^{(c)}(n))$ upper bound on the space complexity of the problem. Here $\log^{(c)}(n)$ is the iterative $\log$ function. To simplify the presentation, we make the following assumptions: $n$ and $m$ are polynomially far; approximation error $ε$ and parameter $k$ are constants. We observe a natural bijection between streams and special matrices. Our main technical contribution is a non-uniform sampling method on matrices. We call our method a \emph{pick-and-drop sampling}; it samples a heavy element (i.e., element $i$ with frequency $Ω(F_k)$) with probability $Ω(1/n^{1-2/k})$ and gives approximation $\tilde{f_i} \ge (1-ε)f_i$. In addition, the estimations never exceed the real values, that is $ \tilde{f_j} \le f_j$ for all $j$. As a result, we reduce the space complexity of finding a heavy element to $O(n^{1-2/k}\log(n))$ bits. We apply our method of recursive sketches and resolve the problem with $O(n^{1-2/k}\log(n)\log^{(c)}(n))$ bits.

preprint2010arXiv

AMS Without 4-Wise Independence on Product Domains

In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that 4-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains $[n]^k$ by using the product of 4-wise independent functions on $[n]$. Our work extends that of Indyk and McGregor, who showed the result for $k = 2$. Their primary motivation was the problem of identifying correlations in data streams. In their model, a stream of pairs $(i,j) \in [n]^2$ arrive, giving a joint distribution $(X,Y)$, and they find approximation algorithms for how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close $X$ and $Y$ are to being independent. By using our technique, we obtain a new result for the problem of approximating the $\ell_2$ distance between the joint distribution and the product of the marginal distributions for $k$-ary vectors, instead of just pairs, in a single pass. Our analysis gives a randomized algorithm that is a $(1 \pm ε)$ approximation (with probability $1-δ$) that requires space logarithmic in $n$ and $m$ and proportional to $3^k$.

preprint2010arXiv

Rademacher Chaos, Random Eulerian Graphs and The Sparse Johnson-Lindenstrauss Transform

The celebrated dimension reduction lemma of Johnson and Lindenstrauss has numerous computational and other applications. Due to its application in practice, speeding up the computation of a Johnson-Lindenstrauss style dimension reduction is an important question. Recently, Dasgupta, Kumar, and Sarlos (STOC 2010) constructed such a transform that uses a sparse matrix. This is motivated by the desire to speed up the computation when applied to sparse input vectors, a scenario that comes up in applications. The sparsity of their construction was further improved by Kane and Nelson (ArXiv 2010). We improve the previous bound on the number of non-zero entries per column of Kane and Nelson from $O(1/ε\log(1/δ)\log(k/δ))$ (where the target dimension is $k$, the distortion is $1\pm ε$, and the failure probability is $δ$) to $$ O\left({1\overε} \left({\log(1/δ)\log\log\log(1/δ) \over \log\log(1/δ)}\right)^2\right). $$ We also improve the amount of randomness needed to generate the matrix. Our results are obtained by connecting the moments of an order 2 Rademacher chaos to the combinatorial properties of random Eulerian multigraphs. Estimating the chance that a random multigraph is composed of a given number of node-disjoint Eulerian components leads to a new tail bound on the chaos. Our estimates may be of independent interest, and as this part of the argument is decoupled from the analysis of the coefficients of the chaos, we believe that our methods can be useful in the analysis of other chaoses.

preprint2010arXiv

Recursive Sketching For Frequency Moments

In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute $F_k$ (for $k>2$) in space complexity $O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k})$, which is optimal up to (large) poly-logarithmic factors in $n$ and $m$, where $m$ is the length of the stream and $n$ is the upper bound on the number of distinct elements in a stream. The best known lower bound for large moments is $Ω(\log(n)n^{1-\frac2k})$. A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to $O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}})$. Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier." Using our simple recursive sketch, we provide a different yet simple approach to obtain a $O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})$ algorithm for constant $ε$ (our bound is, in fact, somewhat stronger, where the $(\log\log n)$ term can be replaced by any constant number of $\log $ iterations instead of just two or three, thus approaching $log^*n$. Our bound also works for non-constant $ε$ (for details see the body of the paper). Further, our algorithm requires only $4$-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments.

Vladimir Braverman

What is connected

Connect this record

See the researcher in context

Building this map preview

33 published item(s)

AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Online Learning with Limited Information in the Sliding Window Model

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

New Coresets for Projective Clustering and Applications

Pretrained Models for Multilingual Federated Learning

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations

Sublinear Time Spectral Density Estimation

The Benefits of Implicit Regularization from SGD in Least Squares Problems

The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

Communication-efficient distributed SGD with Sketching

Coresets for Clustering in Excluded-minor Graphs and Beyond

Data-Independent Neural Pruning via Coresets

Data-Independent Structured Pruning of Neural Networks via Coresets

Obtaining Adjustable Regularization for Free via Iterate Averaging

On the Noisy Gradient Descent that Generalizes as SGD

Schatten Norms in Matrix Streams: Hello Sparsity, Goodbye Dimension

Universal Streaming of Subset Norms

Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors

A Unified Approach for Clustering Problems on Sliding Windows

Approximating Subadditive Hadamard Functions on Implicit Matrices

Beating CountSketch for Heavy Hitters in Insertion Streams

Universal sketches for the frequency negative moments and other decreasing streaming sums

Weighted Sampling Without Replacement from Data Streams

Approximating Large Frequency Moments with $O(n^{1-2/k})$ Bits

Universal Streaming

How Hard is Counting Triangles in the Streaming Model

How to Catch L_2-Heavy-Hitters on Sliding Windows

Approximating Large Frequency Moments with Pick-and-Drop Sampling

AMS Without 4-Wise Independence on Product Domains

Rademacher Chaos, Random Eulerian Graphs and The Sparse Johnson-Lindenstrauss Transform

Recursive Sketching For Frequency Moments