Source author record

David P. Woodruff

David P. Woodruff appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Computational Complexity Information Theory math.IT Computational Geometry Distributed, Parallel, and Cluster Computing math.NA Numerical Analysis Cryptography and Security Databases Discrete Mathematics math.PR quant-ph

Catalog footprint

What is connected

68works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Online Learning with Limited Information in the Sliding Window Model

Motivated by recent work on the experts problem in the streaming model, we consider the experts problem in the sliding window model. The sliding window model is a well-studied model that captures applications such as traffic monitoring, epidemic tracking, and automated trading, where recent information is more valuable than older data. Formally, we have $n$ experts, $T$ days, the ability to query the predictions of $q$ experts on each day, a limited amount of memory, and should achieve the (near-)optimal regret $\sqrt{nW}\text{polylog}(nT)$ regret over any window of the last $W$ days. While it is impossible to achieve such regret with $1$ query, we show that with $2$ queries we can achieve such regret and with only $\text{polylog}(nT)$ bits of memory. Not only are our algorithms optimal for sliding windows, but we also show for every interval $\mathcal{I}$ of days that we achieve $\sqrt{n|\mathcal{I}|}\text{polylog}(nT)$ regret with $2$ queries and only $\text{polylog}(nT)$ bits of memory, providing an exponential improvement on the memory of previous interval regret algorithms. Building upon these techniques, we address the bandit problem in data streams, where $q=1$, achieving $n T^{2/3}\text{polylog}(T)$ regret with $\text{polylog}(nT)$ memory, which is the first sublinear regret in the streaming model in the bandit setting with polylogarithmic memory; this can be further improved to the optimal $\mathcal{O}(\sqrt{nT})$ regret if the best expert's losses are in a random order.

preprint2026arXiv

Progress on the Courtade-Kumar Conjecture: Optimal High-Noise Entropy Bounds and Generalized Coordinate-wise Mutual Information

The Courtade-Kumar conjecture posits that dictatorship functions maximize the mutual information between the function's output and a noisy version of its input over the Boolean hypercube. We present two significant advancements related to this conjecture. First, we resolve an open question posed by Courtade and Kumar, proving that for any Boolean function (regardless of bias), the sum of mutual information between the function's output and the individual noisy input coordinates is bounded by $1-H(α)$, where $α$ is the noise parameter of the Binary Symmetric Channel. This generalizes their previous result which was restricted to balanced Boolean functions. Second, we advance the study of the main conjecture in the high noise regime. We establish an optimal error bound of $O(λ^2)$ for the asymptotic entropy expansion, where $λ= (1-2α)^2$, improving upon the previous best-known bounds. This refined analysis leads to a sharp, linear Fourier concentration bound for highly informative functions and significantly extends the range of the noise parameter $λ$ for which the conjecture is proven to hold.

preprint2026arXiv

Query-Efficient Locally Private Hypothesis Selection via the Scheffe Graph

We propose an algorithm with improved query-complexity for the problem of hypothesis selection under local differential privacy constraints. Given a set of $k$ probability distributions $Q$, we describe an algorithm that satisfies local differential privacy, performs $\tilde{O}(k^{3/2})$ non-adaptive queries to individuals who each have samples from a probability distribution $p$, and outputs a probability distribution from the set $Q$ which is nearly the closest to $p$. Previous algorithms required either $Ω(k^2)$ queries or many rounds of interactive queries. Technically, we introduce a new object we dub the Scheffé graph, which captures structure of the differences between distributions in $Q$, and may be of more broad interest for hypothesis selection tasks.

preprint2024arXiv

Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming

We revisit Nisan's classical pseudorandom generator (PRG) for space-bounded computation (STOC 1990) and its applications in streaming algorithms. We describe a new generator, HashPRG, that can be thought of as a symmetric version of Nisan's generator over larger alphabets. Our generator allows a trade-off between seed length and the time needed to compute a given block of the generator's output. HashPRG can be used to obtain derandomizations with much better update time and \emph{without sacrificing space} for a large number of data stream algorithms, such as $F_p$ estimation in the parameter regimes $p > 2$ and $0 < p < 2$ and CountSketch with tight estimation guarantees as analyzed by Minton and Price (SODA 2014) which assumed access to a random oracle. We also show a recent analysis of Private CountSketch can be derandomized using our techniques. For a $d$-dimensional vector $x$ being updated in a turnstile stream, we show that $\|x\|_{\infty}$ can be estimated up to an additive error of $\varepsilon\|x\|_{2}$ using $O(\varepsilon^{-2}\log(1/\varepsilon)\log d)$ bits of space. Additionally, the update time of this algorithm is $O(\log 1/\varepsilon)$ in the Word RAM model. We show that the space complexity of this algorithm is optimal up to constant factors. However, for vectors $x$ with $\|x\|_{\infty} = Θ(\|x\|_{2})$, we show that the lower bound can be broken by giving an algorithm that uses $O(\varepsilon^{-2}\log d)$ bits of space which approximates $\|x\|_{\infty}$ up to an additive error of $\varepsilon\|x\|_{2}$. We use our aforementioned derandomization of the CountSketch data structure to obtain this algorithm, and using the time-space trade off of HashPRG, we show that the update time of this algorithm is also $O(\log 1/\varepsilon)$ in the Word RAM model.

preprint2024arXiv

Sharper Bounds for $\ell_p$ Sensitivity Sampling

In large scale machine learning, random sampling is a popular way to approximate datasets by a small representative subset of examples. In particular, sensitivity sampling is an intensely studied technique which provides provable guarantees on the quality of approximation, while reducing the number of examples to the product of the VC dimension $d$ and the total sensitivity $\mathfrak S$ in remarkably general settings. However, guarantees going beyond this general bound of $\mathfrak S d$ are known in perhaps only one setting, for $\ell_2$ subspace embeddings, despite intense study of sensitivity sampling in prior work. In this work, we show the first bounds for sensitivity sampling for $\ell_p$ subspace embeddings for $p > 2$ that improve over the general $\mathfrak S d$ bound, achieving a bound of roughly $\mathfrak S^{2-2/p}$ for $2<p<\infty$. Furthermore, our techniques yield further new results in the study of sampling algorithms, showing that the root leverage score sampling algorithm achieves a bound of roughly $d$ for $1\leq p<2$, and that a combination of leverage score and sensitivity sampling achieves an improved bound of roughly $d^{2/p}\mathfrak S^{2-4/p}$ for $2<p<\infty$. Our sensitivity sampling results yield the best known sample complexity for a wide class of structured matrices that have small $\ell_p$ sensitivity.

preprint2023arXiv

Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time

We study fundamental problems in linear algebra, such as finding a maximal linearly independent subset of rows or columns (a basis), solving linear regression, or computing a subspace embedding. For these problems, we consider input matrices $\mathbf{A}\in\mathbb{R}^{n\times d}$ with $n > d$. The input can be read in $\text{nnz}(\mathbf{A})$ time, which denotes the number of nonzero entries of $\mathbf{A}$. In this paper, we show that beyond the time required to read the input matrix, these fundamental linear algebra problems can be solved in $d^ω$ time, i.e., where $ω\approx 2.37$ is the current matrix-multiplication exponent. To do so, we introduce a constant-factor subspace embedding with the optimal $m=\mathcal{O}(d)$ number of rows, and which can be applied in time $\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}α\right) + d^{2 + α}\text{poly}(\log d)$ for any trade-off parameter $α>0$, tightening a recent result by Chepurko et. al. [SODA 2022] that achieves an $\exp(\text{poly}(\log\log n))$ distortion with $m=d\cdot\text{poly}(\log\log d)$ rows in $\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}α+d^{2+α+o(1)}\right)$ time. Our subspace embedding uses a recently shown property of stacked Subsampled Randomized Hadamard Transforms (SRHT), which actually increase the input dimension, to "spread" the mass of an input vector among a large number of coordinates, followed by random sampling. To control the effects of random sampling, we use fast semidefinite programming to reweight the rows. We then use our constant-factor subspace embedding to give the first optimal runtime algorithms for finding a maximal linearly independent subset of columns, regression, and leverage score sampling. To do so, we also introduce a novel subroutine that iteratively grows a set of independent rows, which may be of independent interest.

preprint2022arXiv

Adaptive Sketches for Robust Regression with Importance Sampling

We introduce data structures for solving robust regression through stochastic gradient descent (SGD) by sampling gradients with probability proportional to their norm, i.e., importance sampling. Although SGD is widely used for large scale machine learning, it is well-known for possibly experiencing slow convergence rates due to the high variance from uniform sampling. On the other hand, importance sampling can significantly decrease the variance but is usually difficult to implement because computing the sampling probabilities requires additional passes over the data, in which case standard gradient descent (GD) could be used instead. In this paper, we introduce an algorithm that approximately samples $T$ gradients of dimension $d$ from nearly the optimal importance sampling distribution for a robust regression problem over $n$ rows. Thus our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data. Our techniques also extend to performing importance sampling for second-order optimization.

preprint2022arXiv

Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis

A common method in training neural networks is to initialize all the weights to be independent Gaussian vectors. We observe that by instead initializing the weights into independent pairs, where each pair consists of two identical Gaussian vectors, we can significantly improve the convergence analysis. While a similar technique has been studied for random inputs [Daniely, NeurIPS 2020], it has not been analyzed with arbitrary inputs. Using this technique, we show how to significantly reduce the number of neurons required for two-layer ReLU networks, both in the under-parameterized setting with logistic loss, from roughly $γ^{-8}$ [Ji and Telgarsky, ICLR 2020] to $γ^{-2}$, where $γ$ denotes the separation margin with a Neural Tangent Kernel, as well as in the over-parameterized setting with squared loss, from roughly $n^4$ [Song and Yang, 2019] to $n^2$, implicitly also improving the recent running time bound of [Brand, Peng, Song and Weinstein, ITCS 2021]. For the under-parameterized setting we also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.

preprint2022arXiv

Fast Regression for Structured Inputs

We study the $\ell_p$ regression problem, which requires finding $\mathbf{x}\in\mathbb R^{d}$ that minimizes $\|\mathbf{A}\mathbf{x}-\mathbf{b}\|_p$ for a matrix $\mathbf{A}\in\mathbb R^{n \times d}$ and response vector $\mathbf{b}\in\mathbb R^{n}$. There has been recent interest in developing subsampling methods for this problem that can outperform standard techniques when $n$ is very large. However, all known subsampling approaches have run time that depends exponentially on $p$, typically, $d^{\mathcal{O}(p)}$, which can be prohibitively expensive. We improve on this work by showing that for a large class of common \emph{structured matrices}, such as combinations of low-rank matrices, sparse matrices, and Vandermonde matrices, there are subsampling based methods for $\ell_p$ regression that depend polynomially on $p$. For example, we give an algorithm for $\ell_p$ regression on Vandermonde matrices that runs in time $\mathcal{O}(n\log^3 n+(dp^2)^{0.5+ω}\cdot\text{polylog}\,n)$, where $ω$ is the exponent of matrix multiplication. The polynomial dependence on $p$ crucially allows our algorithms to extend naturally to efficient algorithms for $\ell_\infty$ regression, via approximation of $\ell_\infty$ by $\ell_{\mathcal{O}(\log n)}$. Of practical interest, we also develop a new subsampling algorithm for $\ell_p$ regression for arbitrary matrices, which is simpler than previous approaches for $p \ge 4$.

preprint2022arXiv

Learning Augmented Binary Search Trees

A treap is a classic randomized binary search tree data structure that is easy to implement and supports O(\log n) expected time access. However, classic treaps do not take advantage of the input distribution or patterns in the input. Given recent advances in algorithms with predictions, we propose pairing treaps with machine advice to form a learning-augmented treap. We are the first to propose a learning-augmented data structure that supports binary search tree operations such as range-query and successor functionalities. With the assumption that we have access to advice from a frequency estimation oracle, we assign learned priorities to the nodes to better improve the treap's structure. We theoretically analyze the learning-augmented treap's performance under various input distributions and show that under those circumstances, our learning-augmented treap has stronger guarantees than classic treaps and other classic tree-based data structures. Further, we experimentally evaluate our learned treap on synthetic datasets and demonstrate a performance advantage over other search tree data structures. We also present experiments on real world datasets with known frequency estimation oracles and show improvements as well.

preprint2022arXiv

Learning-Augmented $k$-means Clustering

$k$-means clustering is a well-studied problem due to its wide applicability. Unfortunately, there exist strong theoretical limits on the performance of any algorithm for the $k$-means problem on worst-case inputs. To overcome this barrier, we consider a scenario where "advice" is provided to help perform clustering. Specifically, we consider the $k$-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal clustering up to some, possibly adversarial, error. We present an algorithm whose performance improves along with the accuracy of the predictor, even though naïvely following the accurate predictor can still lead to a high clustering cost. Thus if the predictor is sufficiently accurate, we can retrieve a close to optimal clustering with nearly optimal runtime, breaking known computational barriers for algorithms that do not have access to such advice. We evaluate our algorithms on real datasets and show significant improvements in the quality of clustering.

preprint2022arXiv

Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time

We propose an input sparsity time sampling algorithm that can spectrally approximate the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices using a nearly optimal number of samples, improving upon all previously known methods by poly$(q)$ factors. Furthermore, for the important special case of the $q$-fold self-tensoring of a dataset, which is the feature matrix of the degree-$q$ polynomial kernel, the leading term of our method's runtime is proportional to the size of the input dataset and has no dependence on $q$. Previous techniques either incur poly$(q)$ slowdowns in their runtime or remove the dependence on $q$ at the expense of having sub-optimal target dimension, and depend quadratically on the number of data-points in their runtime. Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time that only depends on the size of $X$, and at the same time their $q$-fold Kronecker product acts as a near-isometry for any fixed vector in the column span of $X^{\otimes q}$. We also show that our sampling methods generalize to other classes of kernels beyond polynomial, such as Gaussian and Neural Tangent kernels.

preprint2022arXiv

Low-Rank Approximation with $1/ε^{1/3}$ Matrix-Vector Products

We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Here, given access to a matrix $A$ through matrix-vector products, an accuracy parameter $ε$, and a target rank $k$, the goal is to find a rank-$k$ matrix $Z$ with orthonormal columns such that $\| A(I -ZZ^\top)\|_{S_p} \leq (1+ε)\min_{U^\top U = I_k} \|A(I - U U^\top)\|_{S_p}$, where $\|M\|_{S_p}$ denotes the $\ell_p$ norm of the the singular values of $M$. For the special cases of $p=2$ (Frobenius norm) and $p = \infty$ (Spectral norm), Musco and Musco (NeurIPS 2015) obtained an algorithm based on Krylov methods that uses $\tilde{O}(k/\sqrtε)$ matrix-vector products, improving on the naïve $\tilde{O}(k/ε)$ dependence obtainable by the power method, where $\tilde{O}$ suppresses poly$(\log(dk/ε))$ factors. Our main result is an algorithm that uses only $\tilde{O}(kp^{1/6}/ε^{1/3})$ matrix-vector products, and works for all $p \geq 1$. For $p = 2$ our bound improves the previous $\tilde{O}(k/ε^{1/2})$ bound to $\tilde{O}(k/ε^{1/3})$. Since the Schatten-$p$ and Schatten-$\infty$ norms are the same up to a $(1+ ε)$-factor when $p \geq (\log d)/ε$, our bound recovers the result of Musco and Musco for $p = \infty$. Further, we prove a matrix-vector query lower bound of $Ω(1/ε^{1/3})$ for any fixed constant $p \geq 1$, showing that surprisingly $\tildeΘ(1/ε^{1/3})$ is the optimal complexity for constant~$k$. To obtain our results, we introduce several new techniques, including optimizing over multiple Krylov subspaces simultaneously, and pinching inequalities for partitioned operators. Our lower bound for $p \in [1,2]$ uses the Araki-Lieb-Thirring trace inequality, whereas for $p>2$, we appeal to a norm-compression inequality for aligned partitioned operators.

preprint2022arXiv

Memory Bounds for the Experts Problem

Online learning with expert advice is a fundamental problem of sequential prediction. In this problem, the algorithm has access to a set of $n$ "experts" who make predictions on each day. The goal on each day is to process these predictions, and make a prediction with the minimum cost. After making a prediction, the algorithm sees the actual outcome on that day, updates its state, and then moves on to the next day. An algorithm is judged by how well it does compared to the best expert in the set. The classical algorithm for this problem is the multiplicative weights algorithm. However, every application, to our knowledge, relies on storing weights for every expert, and uses $Ω(n)$ memory. There is little work on understanding the memory required to solve the online learning with expert advice problem, or run standard sequential prediction algorithms, in natural streaming models, which is especially important when the number of experts, as well as the number of days on which the experts make predictions, is large. We initiate the study of the learning with expert advice problem in the streaming setting, and show lower and upper bounds. Our lower bound for i.i.d., random order, and adversarial order streams uses a reduction to a custom-built problem using a novel masking technique, to show a smooth trade-off for regret versus memory. Our upper bounds show novel ways to run standard sequential prediction algorithms in rounds on small "pools" of experts, thus reducing the necessary memory. For random-order streams, we show that our upper bound is tight up to low order terms. We hope that these results and techniques will have broad applications in online learning, and can inspire algorithms based on standard sequential prediction techniques, like multiplicative weights, for a wide range of other problems in the memory-constrained setting.

preprint2022arXiv

Noisy Boolean Hidden Matching with Applications

The Boolean Hidden Matching (BHM) problem, introduced in a seminal paper of Gavinsky et. al. [STOC'07], has played an important role in the streaming lower bounds for graph problems such as triangle and subgraph counting, maximum matching, MAX-CUT, Schatten $p$-norm approximation, maximum acyclic subgraph, testing bipartiteness, $k$-connectivity, and cycle-freeness. The one-way communication complexity of the Boolean Hidden Matching problem on a universe of size $n$ is $Θ(\sqrt{n})$, resulting in $Ω(\sqrt{n})$ lower bounds for constant factor approximations to several of the aforementioned graph problems. The related (and, in fact, more general) Boolean Hidden Hypermatching (BHH) problem introduced by Verbin and Yu [SODA'11] provides an approach to proving higher lower bounds of $Ω(n^{1-1/t})$ for integer $t\geq 2$. Reductions based on Boolean Hidden Hypermatching generate distributions on graphs with connected components of diameter about $t$, and basically show that long range exploration is hard in the streaming model of computation with adversarial arrivals. In this paper we introduce a natural variant of the BHM problem, called noisy BHM (and its natural noisy BHH variant), that we use to obtain higher than $Ω(\sqrt{n})$ lower bounds for approximating several of the aforementioned problems in graph streams when the input graphs consist only of components of diameter bounded by a fixed constant. We also use the noisy BHM problem to show that the problem of classifying whether an underlying graph is isomorphic to a complete binary tree in insertion-only streams requires $Ω(n)$ space, which seems challenging to show using BHM or BHH alone.

preprint2022arXiv

Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra

We create classical (non-quantum) dynamic data structures supporting queries for recommender systems and least-squares regression that are comparable to their quantum analogues. De-quantizing such algorithms has received a flurry of attention in recent years; we obtain sharper bounds for these problems. More significantly, we achieve these improvements by arguing that the previous quantum-inspired algorithms for these problems are doing leverage or ridge-leverage score sampling in disguise; these are powerful and standard techniques in randomized numerical linear algebra. With this recognition, we are able to employ the large body of work in numerical linear algebra to obtain algorithms for these problems that are simpler or faster (or both) than existing approaches. Our experiments demonstrate that the proposed data structures also work well on real-world datasets.

preprint2022arXiv

Separations for Estimating Large Frequency Moments on Data Streams

We study the classical problem of moment estimation of an underlying vector whose $n$ coordinates are implicitly defined through a series of updates in a data stream. We show that if the updates to the vector arrive in the random-order insertion-only model, then there exist space efficient algorithms with improved dependencies on the approximation parameter $\varepsilon$. In particular, for any real $p > 2$, we first obtain an algorithm for $F_p$ moment estimation using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{4/p}}\cdot n^{1-2/p}\right)$ bits of memory. Our techniques also give algorithms for $F_p$ moment estimation with $p>2$ on arbitrary order insertion-only and turnstile streams, using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{4/p}}\cdot n^{1-2/p}\right)$ bits of space and two passes, which is the first optimal multi-pass $F_p$ estimation algorithm up to $\log n$ factors. Finally, we give an improved lower bound of $Ω\left(\frac{1}{\varepsilon^2}\cdot n^{1-2/p}\right)$ for one-pass insertion-only streams. Our results separate the complexity of this problem both between random and non-random orders, as well as one-pass and multi-pass streams.

preprint2022arXiv

Sketching Algorithms and Lower Bounds for Ridge Regression

We give a sketching-based iterative algorithm that computes a $1+\varepsilon$ approximate solution for the ridge regression problem $\min_x \|Ax-b\|_2^2 +λ\|x\|_2^2$ where $A \in R^{n \times d}$ with $d \ge n$. Our algorithm, for a constant number of iterations (requiring a constant number of passes over the input), improves upon earlier work (Chowdhury et al.) by requiring that the sketching matrix only has a weaker Approximate Matrix Multiplication (AMM) guarantee that depends on $\varepsilon$, along with a constant subspace embedding guarantee. The earlier work instead requires that the sketching matrix has a subspace embedding guarantee that depends on $\varepsilon$. For example, to produce a $1+\varepsilon$ approximate solution in $1$ iteration, which requires $2$ passes over the input, our algorithm requires the OSNAP embedding to have $m= O(nσ^2/λ\varepsilon)$ rows with a sparsity parameter $s = O(\log(n))$, whereas the earlier algorithm of Chowdhury et al. with the same number of rows of OSNAP requires a sparsity $s = O(\sqrt{σ^2/λ\varepsilon} \cdot \log(n))$, where $σ= \opnorm{A}$ is the spectral norm of the matrix $A$. We also show that this algorithm can be used to give faster algorithms for kernel ridge regression. Finally, we show that the sketch size required for our algorithm is essentially optimal for a natural framework of algorithms for ridge regression by proving lower bounds on oblivious sketching matrices for AMM. The sketch size lower bounds for AMM may be of independent interest.

preprint2022arXiv

Streaming Algorithms with Large Approximation Factors

We initiate a broad study of classical problems in the streaming model with insertions and deletions in the setting where we allow the approximation factor $α$ to be much larger than $1$. Such algorithms can use significantly less memory than the usual setting for which $α= 1+ε$ for an $ε\in (0,1)$. We study large approximations for a number of problems in sketching and streaming and the following are some of our results. For the $\ell_p$ norm/quasinorm $\|x\|_p$ of an $n$-dimensional vector $x$, $0 < p \le 2$, we show that obtaining a $\poly(n)$-approximation requires the same amount of memory as obtaining an $O(1)$-approximation for any $M = n^{Θ(1)}$. For estimating the $\ell_p$ norm, $p > 2$, we show an upper bound of $O(n^{1-2/p} (\log n \allowbreak \log M)/α^{2})$ bits for an $α$-approximation, and give a matching lower bound, for almost the full range of $α\geq 1$ for linear sketches. For the $\ell_2$-heavy hitters problem, we show that the known lower bound of $Ω(k \log n\log M)$ bits for identifying $(1/k)$-heavy hitters holds even if we are allowed to output items that are $1/(αk)$-heavy, for almost the full range of $α$, provided the algorithm succeeds with probability $1-O(1/n)$. We also obtain a lower bound for linear sketches that is tight even for constant probability algorithms. For estimating the number $\ell_0$ of distinct elements, we give an $n^{1/t}$-approximation algorithm using $O(t\log \log M)$ bits of space, as well as a lower bound of $Ω(t)$ bits, both excluding the storage of random bits.

preprint2022arXiv

Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension

We obtain the first strong coresets for the $k$-median and subspace approximation problems with sum of distances objective function, on $n$ points in $d$ dimensions, with a number of weighted points that is independent of both $n$ and $d$; namely, our coresets have size $\text{poly}(k/ε)$. A strong coreset $(1+ε)$-approximates the cost function for all possible sets of centers simultaneously. We also give efficient $\text{nnz}(A) + (n+d)\text{poly}(k/ε) + \exp(\text{poly}(k/ε))$ time algorithms for computing these coresets. We obtain the result by introducing a new dimensionality reduction technique for coresets that significantly generalizes an earlier result of Feldman, Sohler and Schmidt \cite{FSS13} for squared Euclidean distances to sums of $p$-th powers of Euclidean distances for constant $p\ge1$.

preprint2022arXiv

Triangle and Four Cycle Counting with Predictions in Graph Streams

We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior "classical" algorithms that did not use oracles. In this paper, we explore the power of a "heavy edge" oracle in multiple graph edge streaming models. In the adjacency list model, we present a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on "classical" streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms.

preprint2021arXiv

A Framework for Adversarially Robust Streaming Algorithms

We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems in the streaming literature do not admit sublinear-space deterministic algorithms; on the other hand, classical space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the natural question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems. In this work, we show that the answer is positive for various important streaming problems in the insertion-only model, including distinct elements and more generally $F_p$-estimation, $F_p$-heavy hitters, entropy estimation, and others. For all of these problems, we develop adversarially robust $(1+\varepsilon)$-approximation algorithms whose required space matches that of the best known non-robust algorithms up to a $\text{poly}(\log n, 1/\varepsilon)$ multiplicative factor (and in some cases even up to a constant factor). Towards this end, we develop several generic tools allowing one to efficiently transform a non-robust streaming algorithm into a robust one in various scenarios.

preprint2021arXiv

A PTAS for $\ell_p$-Low Rank Approximation

A number of recent works have studied algorithms for entrywise $\ell_p$-low rank approximation, namely, algorithms which given an $n \times d$ matrix $A$ (with $n \geq d$), output a rank-$k$ matrix $B$ minimizing $\|A-B\|_p^p=\sum_{i,j}|A_{i,j}-B_{i,j}|^p$ when $p > 0$; and $\|A-B\|_0=\sum_{i,j}[A_{i,j}\neq B_{i,j}]$ for $p=0$. On the algorithmic side, for $p \in (0,2)$, we give the first $(1+ε)$-approximation algorithm running in time $n^{\text{poly}(k/ε)}$. Further, for $p = 0$, we give the first almost-linear time approximation scheme for what we call the Generalized Binary $\ell_0$-Rank-$k$ problem. Our algorithm computes $(1+ε)$-approximation in time $(1/ε)^{2^{O(k)}/ε^{2}} \cdot nd^{1+o(1)}$. On the hardness of approximation side, for $p \in (1,2)$, assuming the Small Set Expansion Hypothesis and the Exponential Time Hypothesis (ETH), we show that there exists $δ:= δ(α) > 0$ such that the entrywise $\ell_p$-Rank-$k$ problem has no $α$-approximation algorithm running in time $2^{k^δ}$.

preprint2021arXiv

On Coresets for Logistic Regression

Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $μ(X)$, which quantifies the hardness of compressing a data set for logistic regression. $μ(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $μ(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\varepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.

preprint2021arXiv

Subspace exploration: Bounds on Projected Frequency Estimation

Given an $n \times d$ dimensional dataset $A$, a projection query specifies a subset $C \subseteq [d]$ of columns which yields a new $n \times |C|$ array. We study the space complexity of computing data analysis functions over such subspaces, including heavy hitters and norms, when the subspaces are revealed only after observing the data. We show that this important class of problems is typically hard: for many problems, we show $2^{Ω(d)}$ lower bounds. However, we present upper bounds which demonstrate space dependency better than $2^d$. That is, for $c,c' \in (0,1)$ and a parameter $N=2^d$ an $N^c$-approximation can be obtained in space $\min(N^{c'},n)$, showing that it is possible to improve on the naïve approach of keeping information for all $2^d$ subsets of $d$ columns. Our results are based on careful constructions of instances using coding theory and novel combinatorial reductions that exhibit such space-approximation tradeoffs.

preprint2020arXiv

Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss

We study the column subset selection problem with respect to the entrywise $\ell_1$-norm loss. It is known that in the worst case, to obtain a good rank-$k$ approximation to a matrix, one needs an arbitrarily large $n^{Ω(1)}$ number of columns to obtain a $(1+ε)$-approximation to the best entrywise $\ell_1$-norm low rank approximation of an $n \times n$ matrix. Nevertheless, we show that under certain minimal and realistic distributional settings, it is possible to obtain a $(1+ε)$-approximation with a nearly linear running time and poly$(k/ε)+O(k\log n)$ columns. Namely, we show that if the input matrix $A$ has the form $A = B + E$, where $B$ is an arbitrary rank-$k$ matrix, and $E$ is a matrix with i.i.d. entries drawn from any distribution $μ$ for which the $(1+γ)$-th moment exists, for an arbitrarily small constant $γ> 0$, then it is possible to obtain a $(1+ε)$-approximate column subset selection to the entrywise $\ell_1$-norm in nearly linear time. Conversely we show that if the first moment does not exist, then it is not possible to obtain a $(1+ε)$-approximate subset selection algorithm even if one chooses any $n^{o(1)}$ columns. This is the first algorithm of any kind for achieving a $(1+ε)$-approximation for entrywise $\ell_1$-norm loss low rank approximation.

preprint2020arXiv

Low Rank Approximation with Entrywise $\ell_1$-Norm Error

We study the $\ell_1$-low rank approximation problem, where for a given $n \times d$ matrix $A$ and approximation factor $α\geq 1$, the goal is to output a rank-$k$ matrix $\widehat{A}$ for which $$\|A-\widehat{A}\|_1 \leq α\cdot \min_{\textrm{rank-}k\textrm{ matrices}~A'}\|A-A'\|_1,$$ where for an $n \times d$ matrix $C$, we let $\|C\|_1 = \sum_{i=1}^n \sum_{j=1}^d |C_{i,j}|$. This error measure is known to be more robust than the Frobenius norm in the presence of outliers and is indicated in models where Gaussian assumptions on the noise may not apply. The problem was shown to be NP-hard by Gillis and Vavasis and a number of heuristics have been proposed. It was asked in multiple places if there are any approximation algorithms. We give the first provable approximation algorithms for $\ell_1$-low rank approximation, showing that it is possible to achieve approximation factor $α= (\log d) \cdot \mathrm{poly}(k)$ in $\mathrm{nnz}(A) + (n+d) \mathrm{poly}(k)$ time, where $\mathrm{nnz}(A)$ denotes the number of non-zero entries of $A$. If $k$ is constant, we further improve the approximation ratio to $O(1)$ with a $\mathrm{poly}(nd)$-time algorithm. Under the Exponential Time Hypothesis, we show there is no $\mathrm{poly}(nd)$-time algorithm achieving a $(1+\frac{1}{\log^{1+γ}(nd)})$-approximation, for $γ> 0$ an arbitrarily small constant, even when $k = 1$. We give a number of additional results for $\ell_1$-low rank approximation: nearly tight upper and lower bounds for column subset selection, CUR decompositions, extensions to low rank approximation with respect to $\ell_p$-norms for $1 \leq p < 2$ and earthmover distance, low-communication distributed protocols and low-memory streaming algorithms, algorithms with limited randomness, and bicriteria algorithms. We also give a preliminary empirical evaluation.

preprint2020arXiv

LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. Traditionally, similarity search has focused on discovering very similar pairs, for which a variety of efficient algorithms are known. However, recent work highlights the importance of finding pairs of sets with relatively small intersection sizes. For example, in a recommender system, two users may be alike even though their interests only overlap on a small percentage of items. In such systems, some dimensions are often highly skewed because they are very popular. Together these two properties render previous approaches infeasible for large input sizes. To address this problem, we present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. The core of our algorithm is a randomized selection procedure based on Locality Sensitive Filtering. Our method deviates from prior approximate algorithms, which are based on Locality Sensitive Hashing. Theoretically, we show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets. We prove guarantees on the communication, work, and maximum load of LSF-Join, and we also experimentally demonstrate its accuracy on multiple graphs.

preprint2020arXiv

Near Input Sparsity Time Kernel Embeddings via Adaptive Sampling

To accelerate kernel methods, we propose a near input sparsity time algorithm for sampling the high-dimensional feature space implicitly defined by a kernel transformation. Our main contribution is an importance sampling method for subsampling the feature space of a degree $q$ tensoring of data points in almost input sparsity time, improving the recent oblivious sketching method of (Ahle et al., 2020) by a factor of $q^{5/2}/ε^2$. This leads to a subspace embedding for the polynomial kernel, as well as the Gaussian kernel, with a target dimension that is only linearly dependent on the statistical dimension of the kernel and in time which is only linearly dependent on the sparsity of the input dataset. We show how our subspace embedding bounds imply new statistical guarantees for kernel ridge regression. Furthermore, we empirically show that in large-scale regression tasks, our algorithm outperforms state-of-the-art kernel approximation methods.

preprint2020arXiv

Non-Adaptive Adaptive Sampling on Turnstile Streams

Adaptive sampling is a useful algorithmic tool for data summarization problems in the classical centralized setting, where the entire dataset is available to the single processor performing the computation. Adaptive sampling repeatedly selects rows of an underlying matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$, where $n\gg d$, with probabilities proportional to their distances to the subspace of the previously selected rows. Intuitively, adaptive sampling seems to be limited to trivial multi-pass algorithms in the streaming model of computation due to its inherently sequential nature of assigning sampling probabilities to each row only after the previous iteration is completed. Surprisingly, we show this is not the case by giving the first one-pass algorithms for adaptive sampling on turnstile streams and using space $\text{poly}(d,k,\log n)$, where $k$ is the number of adaptive sampling rounds to be performed. Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model. We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume maximization on turnstile streams that use space sublinear in $n$. We complement our volume maximization algorithmic results with lower bounds that are tight up to lower order terms, even for multi-pass algorithms. By a similar construction, we also obtain lower bounds for volume maximization in the row-arrival model, which we match with competitive upper bounds. See paper for full abstract.

preprint2020arXiv

Span Recovery for Deep Neural Networks with Applications to Input Obfuscation

The tremendous success of deep neural networks has motivated the need to better understand the fundamental properties of these networks, but many of the theoretical results proposed have only been for shallow networks. In this paper, we study an important primitive for understanding the meaningful input space of a deep network: span recovery. For $k<n$, let $\mathbf{A} \in \mathbb{R}^{k \times n}$ be the innermost weight matrix of an arbitrary feed forward neural network $M:\mathbb{R}^n \to \mathbb{R}$, so $M(x)$ can be written as $M(x) = σ(\mathbf{A} x)$, for some network $σ:\mathbb{R}^k \to \mathbb{R}$. The goal is then to recover the row span of $\mathbf{A}$ given only oracle access to the value of $M(x)$. We show that if $M$ is a multi-layered network with ReLU activation functions, then partial recovery is possible: namely, we can provably recover $k/2$ linearly independent vectors in the row span of $\mathbf{A}$ using poly$(n)$ non-adaptive queries to $M(x)$. Furthermore, if $M$ has differentiable activation functions, we demonstrate that full span recovery is possible even when the output is first passed through a sign or $0/1$ thresholding function; in this case our algorithm is adaptive. Empirically, we confirm that full span recovery is not always possible, but only for unrealistically thin layers. For reasonably wide networks, we obtain full span recovery on both random networks and networks trained on MNIST data. Furthermore, we demonstrate the utility of span recovery as an attack by inducing neural networks to misclassify data obfuscated by controlled random noise as sensical inputs.

preprint2020arXiv

Streaming Complexity of SVMs

We study the space complexity of solving the bias-regularized SVM problem in the streaming model. This is a classic supervised learning problem that has drawn lots of attention, including for developing fast algorithms for solving the problem approximately. One of the most widely used algorithms for approximately optimizing the SVM objective is Stochastic Gradient Descent (SGD), which requires only $O(\frac{1}{λε})$ random samples, and which immediately yields a streaming algorithm that uses $O(\frac{d}{λε})$ space. For related problems, better streaming algorithms are only known for smooth functions, unlike the SVM objective that we focus on in this work. We initiate an investigation of the space complexity for both finding an approximate optimum of this objective, and for the related ``point estimation'' problem of sketching the data set to evaluate the function value $F_λ$ on any query $(θ, b)$. We show that, for both problems, for dimensions $d=1,2$, one can obtain streaming algorithms with space polynomially smaller than $\frac{1}{λε}$, which is the complexity of SGD for strongly convex functions like the bias-regularized SVM, and which is known to be tight in general, even for $d=1$. We also prove polynomial lower bounds for both point estimation and optimization. In particular, for point estimation we obtain a tight bound of $Θ(1/\sqrtε)$ for $d=1$ and a nearly tight lower bound of $\widetildeΩ(d/ε^2)$ for $d = Ω( \log(1/ε))$. Finally, for optimization, we prove a $Ω(1/\sqrtε)$ lower bound for $d = Ω( \log(1/ε))$, and show similar bounds when $d$ is constant.

preprint2020arXiv

Towards a Zero-One Law for Column Subset Selection

There are a number of approximation algorithms for NP-hard versions of low rank approximation, such as finding a rank-$k$ matrix $B$ minimizing the sum of absolute values of differences to a given $n$-by-$n$ matrix $A$, $\min_{\textrm{rank-}k~B}\|A-B\|_1$, or more generally finding a rank-$k$ matrix $B$ which minimizes the sum of $p$-th powers of absolute values of differences, $\min_{\textrm{rank-}k~B}\|A-B\|_p^p$. Many of these algorithms are linear time columns subset selection algorithms, returning a subset of $\mathrm{poly}(k \log n)$ columns whose cost is no more than a $\mathrm{poly}(k)$ factor larger than the cost of the best rank-$k$ matrix. The above error measures are special cases of the following general entrywise low rank approximation problem: given an arbitrary function $g:\mathbb{R} \rightarrow \mathbb{R}_{\geq 0}$, find a rank-$k$ matrix $B$ which minimizes $\|A-B\|_g = \sum_{i,j}g(A_{i,j}-B_{i,j})$. A natural question is which functions $g$ admit efficient approximation algorithms? Indeed, this is a central question of recent work studying generalized low rank models. In this work we give approximation algorithms for $\textit{every}$ function $g$ which is approximately monotone and satisfies an approximate triangle inequality, and we show both of these conditions are necessary. Further, our algorithm is efficient if the function $g$ admits an efficient approximate regression algorithm. Our approximation algorithms handle functions which are not even scale-invariant, such as the Huber loss function, which we show have very different structural properties than $\ell_p$-norms, e.g., one can show the lack of scale-invariance causes any column subset selection algorithm to provably require a $\sqrt{\log n}$ factor larger number of columns than $\ell_p$-norms; nevertheless we design the first efficient column subset selection algorithms for such error measures.

preprint2020arXiv

Vector-Matrix-Vector Queries for Solving Linear Algebra, Statistics, and Graph Problems

We consider the general problem of learning about a matrix through vector-matrix-vector queries. These queries provide the value of $\boldsymbol{u}^{\mathrm{T}}\boldsymbol{M}\boldsymbol{v}$ over a fixed field $\mathbb{F}$ for a specified pair of vectors $\boldsymbol{u},\boldsymbol{v} \in \mathbb{F}^n$. To motivate these queries, we observe that they generalize many previously studied models, such as independent set queries, cut queries, and standard graph queries. They also specialize the recently studied matrix-vector query model. Our work is exploratory and broad, and we provide new upper and lower bounds for a wide variety of problems, spanning linear algebra, statistics, and graphs. Many of our results are nearly tight, and we use diverse techniques from linear algebra, randomized algorithms, and communication complexity.

preprint2020arXiv

Weighted Maximum Independent Set of Geometric Objects in Turnstile Streams

We study the Maximum Independent Set problem for geometric objects given in the data stream model. A set of geometric objects is said to be independent if the objects are pairwise disjoint. We consider geometric objects in one and two dimensions, i.e., intervals and disks. Let $α$ be the cardinality of the largest independent set. Our goal is to estimate $α$ in a small amount of space, given that the input is received as a one-pass stream. We also consider a generalization of this problem by assigning weights to each object and estimating $β$, the largest value of a weighted independent set. We initialize the study of this problem in the turnstile streaming model (insertions and deletions) and provide the first algorithms for estimating $α$ and $β$. For unit-length intervals, we obtain a $(2+ε)$-approximation to $α$ and $β$ in poly$(\frac{\log(n)}ε)$ space. We also show a matching lower bound. Combined with the $3/2$-approximation for insertion-only streams by Cabello and Perez-Lanterno [CP15], our result implies a separation between the insertion-only and turnstile model. For unit-radius disks, we obtain a $\left(\frac{8\sqrt{3}}π\right)$-approximation to $α$ and $β$ in poly$(\log(n), ε^{-1})$ space, which is closely related to the hexagonal circle packing constant. We provide algorithms for estimating $α$ for arbitrary-length intervals under a bounded intersection assumption and study the parameterized space complexity of estimating $α$ and $β$, where the parameter is the ratio of maximum to minimum interval length.

preprint2020arXiv

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: it provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR $\ell_p$ sampling, weighted sampling of keys according to a power $p\in[0,2]$ of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with the sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as CountSketch. Our method is the first to provide WOR sampling in the important regime of $p>1$ and the first to handle signed updates for $p>0$.

preprint2016arXiv

An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems

We give the first optimal bounds for returning the $\ell_1$-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of $m$ items in $\{1, 2, \dots, n\}$ and parameters $0 < ε< ϕ\leq 1$, let $f_i$ denote the frequency of item $i$, i.e., the number of times item $i$ occurs in the stream. With arbitrarily large constant probability, our algorithm returns all items $i$ for which $f_i \geq ϕm$, returns no items $j$ for which $f_j \leq (ϕ-ε)m$, and returns approximations $\tilde{f}_i$ with $|\tilde{f}_i - f_i| \leq εm$ for each item $i$ that it returns. Our algorithm uses $O(ε^{-1} \logϕ^{-1} + ϕ^{-1} \log n + \log \log m)$ bits of space, processes each stream update in $O(1)$ worst-case time, and can report its output in time linear in the output size. We also prove a lower bound, which implies that our algorithm is optimal up to a constant factor in its space complexity. A modification of our algorithm can be used to estimate the maximum frequency up to an additive $εm$ error in the above amount of space, resolving Question 3 in the IITK 2006 Workshop on Algorithms for Data Streams for the case of $\ell_1$-heavy hitters. We also introduce several variants of the heavy hitters and maximum frequency problems, inspired by rank aggregation and voting schemes, and show how our techniques can be applied in such settings. Unlike the traditional heavy hitters problem, some of these variants look at comparisons between items rather than numerical values to determine the frequency of an item.

preprint2016arXiv

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the $m$ machines receives $n$ data points from a $d$-dimensional Gaussian distribution with unknown mean $θ$ which is promised to be $k$-sparse. The machines communicate by message passing and aim to estimate the mean $θ$. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed \textit{sparse linear regression} problem: to achieve the statistical minimax error, the total communication is at least $Ω(\min\{n,d\}m)$, where $n$ is the number of observations that each machine receives and $d$ is the ambient dimension. These lower results improve upon [Sha14,SD'14] by allowing multi-round iterative communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a \textit{distributed data processing inequality}, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.

preprint2016arXiv

Distributed Low Rank Approximation of Implicit Functions of a Matrix

We study distributed low rank approximation in which the matrix to be approximated is only implicitly represented across the different servers. For example, each of $s$ servers may have an $n \times d$ matrix $A^t$, and we may be interested in computing a low rank approximation to $A = f(\sum_{t=1}^s A^t)$, where $f$ is a function which is applied entrywise to the matrix $\sum_{t=1}^s A^t$. We show for a wide class of functions $f$ it is possible to efficiently compute a $d \times d$ rank-$k$ projection matrix $P$ for which $\|A - AP\|_F^2 \leq \|A - [A]_k\|_F^2 + \varepsilon \|A\|_F^2$, where $AP$ denotes the projection of $A$ onto the row span of $P$, and $[A]_k$ denotes the best rank-$k$ approximation to $A$ given by the singular value decomposition. The communication cost of our protocols is $d \cdot (sk/\varepsilon)^{O(1)}$, and they succeed with high probability. Our framework allows us to efficiently compute a low rank approximation to an entry-wise softmax, to a Gaussian kernel expansion, and to $M$-Estimators applied entrywise (i.e., forms of robust low rank approximation). We also show that our additive error approximation is best possible, in the sense that any protocol achieving relative error for these problems requires significantly more communication. Finally, we experimentally validate our algorithms on real datasets.

preprint2016arXiv

New Algorithms for Heavy Hitters in Data Streams

An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-$k$, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify what it means for an item to be frequent, including what are known as the $\ell_1$-heavy hitters and $\ell_2$-heavy hitters. There are a number of algorithmic solutions for these problems, starting with the work of Misra and Gries, as well as the CountMin and CountSketch data structures, among others. In this survey paper, accompanying an ICDT invited talk, we cover several recent results developed in this area, which improve upon the classical solutions to these problems. In particular, with coauthors we develop new algorithms for finding $\ell_1$-heavy hitters and $\ell_2$-heavy hitters, with significantly less memory required than what was known, and which are optimal in a number of parameter regimes.

preprint2016arXiv

Optimal approximate matrix product in terms of stable rank

We prove, using the subspace embedding guarantee in a black box way, that one can achieve the spectral norm guarantee for approximate matrix multiplication with a dimensionality-reducing map having $m = O(\tilde{r}/\varepsilon^2)$ rows. Here $\tilde{r}$ is the maximum stable rank, i.e. squared ratio of Frobenius and operator norms, of the two matrices being multiplied. This is a quantitative improvement over previous work of [MZ11, KVZ14], and is also optimal for any oblivious dimensionality-reducing map. Furthermore, due to the black box reliance on the subspace embedding property in our proofs, our theorem can be applied to a much more general class of sketching matrices than what was known before, in addition to achieving better bounds. For example, one can apply our theorem to efficient subspace embeddings such as the Subsampled Randomized Hadamard Transform or sparse subspace embeddings, or even with subspace embedding constructions that may be developed in the future. Our main theorem, via connections with spectral error matrix multiplication shown in prior work, implies quantitative improvements for approximate least squares regression and low rank approximation. Our main result has also already been applied to improve dimensionality reduction guarantees for $k$-means clustering [CEMMP14], and implies new results for nonparametric regression [YPW15]. We also separately point out that the proof of the "BSS" deterministic row-sampling result of [BSS12] can be modified to show that for any matrices $A, B$ of stable rank at most $\tilde{r}$, one can achieve the spectral norm guarantee for approximate matrix multiplication of $A^T B$ by deterministically sampling $O(\tilde{r}/\varepsilon^2)$ rows that can be found in polynomial time. The original result of [BSS12] was for rank instead of stable rank. Our observation leads to a stronger version of a main theorem of [KMST10].

preprint2016arXiv

Optimal Principal Component Analysis in Distributed and Streaming Models

We study the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix $A \in R^{m \times n},$ a rank parameter $k < rank(A)$, and an accuracy parameter $0 < ε< 1$, we want to output an $m \times k$ orthonormal matrix $U$ for which $$ || A - U U^T A ||_F^2 \le \left(1 + ε\right) \cdot || A - A_k||_F^2, $$ where $A_k \in R^{m \times n}$ is the best rank-$k$ approximation to $A$. This paper provides improved algorithms for distributed PCA and streaming PCA.

preprint2016arXiv

Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors

A central problem in the theory of algorithms for data streams is to determine which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space. Given a function $g$, we study the space complexity of approximating $\sum_{i=1}^n g(|f_i|)$, where $f\in\mathbb{Z}^n$ is the frequency vector of a turnstile stream. This is a generalization of the well-known frequency moments problem, and previous results apply only when $g$ is monotonic or has a special functional form. Our contribution is to give a condition such that, except for a narrow class of functions $g$, there is a space-efficient approximation algorithm for the sum if and only if $g$ satisfies the condition. The functions $g$ that we are able to characterize include all convex, concave, monotonic, polynomial, and trigonometric functions, among many others, and is the first such characterization for non-monotonic functions. Thus, for nearly all functions of one variable, we answer the open question from the celebrated paper of Alon, Matias and Szegedy (1996).

preprint2015arXiv

Applications of Uniform Sampling: Densest Subgraph and Beyond

Recently [Bhattacharya et al., STOC 2015] provide the first non-trivial algorithm for the densest subgraph problem in the streaming model with additions and deletions to its edges, i.e., for dynamic graph streams. They present a $(0.5-ε)$-approximation algorithm using $\tilde{O}(n)$ space, where factors of $ε$ and $\log(n)$ are suppressed in the $\tilde{O}$ notation. However, the update time of this algorithm is large. To remedy this, they also provide a $(0.25-ε)$-approximation algorithm using $\tilde{O}(n)$ space with update time $\tilde{O}(1)$. In this paper we improve the algorithms by Bhattacharya et al. by providing a $(1-ε)$-approximation algorithm using $\tilde{O}(n)$ space. Our algorithm is conceptually simple - it samples $\tilde{O}(n)$ edges uniformly at random, and finds the densest subgraph on the sampled graph. We also show how to perform this sampling with update time $\tilde{O}(1)$. In addition to this, we show that given oracle access to the edge set, we can implement our algorithm in time $\tilde{O}(n)$ on a graph in the standard RAM model. To the best of our knowledge this is the fastest $(0.5-ε)$-approximation algorithm for the densest subgraph problem in the RAM model given such oracle access. Further, we extend our results to a general class of graph optimization problems that we call heavy subgraph problems. This class contains many interesting problems such as densest subgraph, directed densest subgraph, densest bipartite subgraph, $d$-cut and $d$-heavy connected component. Our result, by characterizing heavy subgraph problems, partially addresses open problem 13 at the IITK Workshop on Algorithms for Data Streams in 2006 regarding the effects of subsampling in this context.

preprint2015arXiv

Beating CountSketch for Heavy Hitters in Insertion Streams

Given a stream $p_1, \ldots, p_m$ of items from a universe $\mathcal{U}$, which, without loss of generality we identify with the set of integers $\{1, 2, \ldots, n\}$, we consider the problem of returning all $\ell_2$-heavy hitters, i.e., those items $j$ for which $f_j \geq ε\sqrt{F_2}$, where $f_j$ is the number of occurrences of item $j$ in the stream, and $F_2 = \sum_{i \in [n]} f_i^2$. Such a guarantee is considerably stronger than the $\ell_1$-guarantee, which finds those $j$ for which $f_j \geq εm$. In 2002, Charikar, Chen, and Farach-Colton suggested the {\sf CountSketch} data structure, which finds all such $j$ using $Θ(\log^2 n)$ bits of space (for constant $ε> 0$). The only known lower bound is $Ω(\log n)$ bits of space, which comes from the need to specify the identities of the items found. In this paper we show it is possible to achieve $O(\log n \log \log n)$ bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including (1) The first algorithm for estimating $F_2$ simultaneously at all points in a stream using only $O(\log n\log\log n)$ bits of space, improving a natural union bound and the algorithm of Huang, Tai, and Yi (2014). (2) A way to estimate the $\ell_{\infty}$ norm of a stream up to additive error $ε\sqrt{F_2}$ with $O(\log n\log\log n)$ bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams.

preprint2015arXiv

Frequent Directions : Simple and Deterministic Matrix Sketching

We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix $A \in R^{n \times d}$ one row at a time. It performed $O(d \times \ell)$ operations per row and maintains a sketch matrix $B \in R^{\ell \times d}$ such that for any $k < \ell$ $\|A^TA - B^TB \|_2 \leq \|A - A_k\|_F^2 / (\ell-k)$ and $\|A - π_{B_k}(A)\|_F^2 \leq \big(1 + \frac{k}{\ell-k}\big) \|A-A_k\|_F^2 $ . Here, $A_k$ stands for the minimizer of $\|A - A_k\|_F$ over all rank $k$ matrices (similarly $B_k$) and $π_{B_k}(A)$ is the rank $k$ matrix resulting from projecting $A$ on the row span of $B_k$. We show both of these bounds are the best possible for the space allowed. The summary is mergeable, and hence trivially parallelizable. Moreover, Frequent Directions outperforms exemplar implementations of existing streaming algorithms in the space-error tradeoff.

preprint2015arXiv

Input Sparsity and Hardness for Robust Subspace Approximation

In the subspace approximation problem, we seek a k-dimensional subspace F of R^d that minimizes the sum of p-th powers of Euclidean distances to a given set of n points a_1, ..., a_n in R^d, for p >= 1. More generally than minimizing sum_i dist(a_i,F)^p,we may wish to minimize sum_i M(dist(a_i,F)) for some loss function M(), for example, M-Estimators, which include the Huber and Tukey loss functions. Such subspaces provide alternatives to the singular value decomposition (SVD), which is the p=2 case, finding such an F that minimizes the sum of squares of distances. For p in [1,2), and for typical M-Estimators, the minimizing $F$ gives a solution that is more robust to outliers than that provided by the SVD. We give several algorithmic and hardness results for these robust subspace approximation problems. We think of the n points as forming an n x d matrix A, and letting nnz(A) denote the number of non-zero entries of A. Our results hold for p in [1,2). We use poly(n) to denote n^{O(1)} as n -> infty. We obtain: (1) For minimizing sum_i dist(a_i,F)^p, we give an algorithm running in O(nnz(A) + (n+d)poly(k/eps) + exp(poly(k/eps))), (2) we show that the problem of minimizing sum_i dist(a_i, F)^p is NP-hard, even to output a (1+1/poly(d))-approximation, answering a question of Kannan and Vempala, and complementing prior results which held for p >2, (3) For loss functions for a wide class of M-Estimators, we give a problem-size reduction: for a parameter K=(log n)^{O(log k)}, our reduction takes O(nnz(A) log n + (n+d) poly(K/eps)) time to reduce the problem to a constrained version involving matrices whose dimensions are poly(K eps^{-1} log n). We also give bicriteria solutions, (4) Our techniques lead to the first O(nnz(A) + poly(d/eps)) time algorithms for (1+eps)-approximate regression for a wide class of convex M-Estimators.

preprint2015arXiv

Nearly-optimal bounds for sparse recovery in generic norms, with applications to $k$-median sketching

We initiate the study of trade-offs between sparsity and the number of measurements in sparse recovery schemes for generic norms. Specifically, for a norm $\|\cdot\|$, sparsity parameter $k$, approximation factor $K>0$, and probability of failure $P>0$, we ask: what is the minimal value of $m$ so that there is a distribution over $m \times n$ matrices $A$ with the property that for any $x$, given $Ax$, we can recover a $k$-sparse approximation to $x$ in the given norm with probability at least $1-P$? We give a partial answer to this problem, by showing that for norms that admit efficient linear sketches, the optimal number of measurements $m$ is closely related to the doubling dimension of the metric induced by the norm $\|\cdot\|$ on the set of all $k$-sparse vectors. By applying our result to specific norms, we cast known measurement bounds in our general framework (for the $\ell_p$ norms, $p \in [1,2]$) as well as provide new, measurement-efficient schemes (for the Earth-Mover Distance norm). The latter result directly implies more succinct linear sketches for the well-studied planar $k$-median clustering problem. Finally, our lower bound for the doubling dimension of the EMD norm enables us to address the open question of [Frahling-Sohler, STOC'05] about the space complexity of clustering problems in the dynamic streaming model.

preprint2015arXiv

Sketching as a Tool for Numerical Linear Algebra

This survey highlights the recent advances in algorithms for numerical linear algebra that have come from the technique of linear sketching, whereby given a matrix, one first compresses it to a much smaller matrix by multiplying it by a (usually) random matrix with certain properties. Much of the expensive computation can then be performed on the smaller matrix, thereby accelerating the solution for the original problem. In this survey we consider least squares as well as robust regression problems, low rank approximation, and graph sparsification. We also discuss a number of variants of these problems. Finally, we discuss the limitations of sketching methods.

preprint2014arXiv

A Sketching Algorithm for Spectral Graph Sparsification

We study the problem of compressing a weighted graph $G$ on $n$ vertices, building a "sketch" $H$ of $G$, so that given any vector $x \in \mathbb{R}^n$, the value $x^T L_G x$ can be approximated up to a multiplicative $1+ε$ factor from only $H$ and $x$, where $L_G$ denotes the Laplacian of $G$. One solution to this problem is to build a spectral sparsifier $H$ of $G$, which, using the result of Batson, Spielman, and Srivastava, consists of $O(n ε^{-2})$ reweighted edges of $G$ and has the property that simultaneously for all $x \in \mathbb{R}^n$, $x^T L_H x = (1 \pm ε) x^T L_G x$. The $O(n ε^{-2})$ bound is optimal for spectral sparsifiers. We show that if one is interested in only preserving the value of $x^T L_G x$ for a {\it fixed} $x \in \mathbb{R}^n$ (specified at query time) with high probability, then there is a sketch $H$ using only $\tilde{O}(n ε^{-1.6})$ bits of space. This is the first data structure achieving a sub-quadratic dependence on $ε$. Our work builds upon recent work of Andoni, Krauthgamer, and Woodruff who showed that $\tilde{O}(n ε^{-1})$ bits of space is possible for preserving a fixed {\it cut query} (i.e., $x\in \{0,1\}^n$) with high probability; here we show that even for a general query vector $x \in \mathbb{R}^n$, a sub-quadratic dependence on $ε$ is possible. Our result for Laplacians is in sharp contrast to sketches for general $n \times n$ positive semidefinite matrices $A$ with $O(\log n)$ bit entries, for which even to preserve the value of $x^T A x$ for a fixed $x \in \mathbb{R}^n$ (specified at query time) up to a $1+ε$ factor with constant probability, we show an $Ω(n ε^{-2})$ lower bound.

preprint2014arXiv

On The Communication Complexity of Linear Algebraic Problems in the Message Passing Model

We study the communication complexity of linear algebraic problems over finite fields in the multi-player message passing model, proving a number of tight lower bounds. Specifically, for a matrix which is distributed among a number of players, we consider the problem of determining its rank, of computing entries in its inverse, and of solving linear equations. We also consider related problems such as computing the generalized inner product of vectors held on different servers. We give a general framework for reducing these multi-player problems to their two-player counterparts, showing that the randomized $s$-player communication complexity of these problems is at least $s$ times the randomized two-player communication complexity. Provided the problem has a certain amount of algebraic symmetry, which we formally define, we can show the hardest input distribution is a symmetric distribution, and therefore apply a recent multi-player lower bound technique of Phillips et al. Further, we give new two-player lower bounds for a number of these problems. In particular, our optimal lower bound for the two-player version of the matrix rank problem resolves an open question of Sun and Wang. A common feature of our lower bounds is that they apply even to the special "threshold promise" versions of these problems, wherein the underlying quantity, e.g., rank, is promised to be one of just two values, one on each side of some critical threshold. These kinds of promise problems are commonplace in the literature on data streaming as sources of hardness for reductions giving space lower bounds.

preprint2014arXiv

Optimal CUR Matrix Decompositions

The CUR decomposition of an $m \times n$ matrix $A$ finds an $m \times c$ matrix $C$ with a subset of $c < n$ columns of $A,$ together with an $r \times n$ matrix $R$ with a subset of $r < m$ rows of $A,$ as well as a $c \times r$ low-rank matrix $U$ such that the matrix $C U R$ approximates the matrix $A,$ that is, $ || A - CUR ||_F^2 \le (1+ε) || A - A_k||_F^2$, where $||.||_F$ denotes the Frobenius norm and $A_k$ is the best $m \times n$ matrix of rank $k$ constructed via the SVD. We present input-sparsity-time and deterministic algorithms for constructing such a CUR decomposition where $c=O(k/ε)$ and $r=O(k/ε)$ and rank$(U) = k$. Up to constant factors, our algorithms are simultaneously optimal in $c, r,$ and rank$(U)$.

preprint2014arXiv

Subspace Embeddings and $\ell_p$-Regression Using Exponential Random Variables

Oblivious low-distortion subspace embeddings are a crucial building block for numerical linear algebra problems. We show for any real $p, 1 \leq p < \infty$, given a matrix $M \in \mathbb{R}^{n \times d}$ with $n \gg d$, with constant probability we can choose a matrix $Π$ with $\max(1, n^{1-2/p}) \poly(d)$ rows and $n$ columns so that simultaneously for all $x \in \mathbb{R}^d$, $\|Mx\|_p \leq \|ΠMx\|_{\infty} \leq \poly(d) \|Mx\|_p.$ Importantly, $ΠM$ can be computed in the optimal $O(\nnz(M))$ time, where $\nnz(M)$ is the number of non-zero entries of $M$. This generalizes all previous oblivious subspace embeddings which required $p \in [1,2]$ due to their use of $p$-stable random variables. Using our matrices $Π$, we also improve the best known distortion of oblivious subspace embeddings of $\ell_1$ into $\ell_1$ with $\tilde{O}(d)$ target dimension in $O(\nnz(M))$ time from $\tilde{O}(d^3)$ to $\tilde{O}(d^2)$, which can further be improved to $\tilde{O}(d^{3/2}) \log^{1/2} n$ if $d = Ω(\log n)$, answering a question of Meng and Mahoney (STOC, 2013). We apply our results to $\ell_p$-regression, obtaining a $(1+\eps)$-approximation in $O(\nnz(M)\log n) + \poly(d/\eps)$ time, improving the best known $\poly(d/\eps)$ factors for every $p \in [1, \infty) \setminus \{2\}$. If one is just interested in a $\poly(d)$ rather than a $(1+\eps)$-approximation to $\ell_p$-regression, a corollary of our results is that for all $p \in [1, \infty)$ we can solve the $\ell_p$-regression problem without using general convex programming, that is, since our subspace embeds into $\ell_{\infty}$ it suffices to solve a linear programming problem. Finally, we give the first protocols for the distributed $\ell_p$-regression problem for every $p \geq 1$ which are nearly optimal in communication and computation.

preprint2014arXiv

The Fast Cauchy Transform and Faster Robust Linear Regression

We provide fast algorithms for overconstrained $\ell_p$ regression and related problems: for an $n\times d$ input matrix $A$ and vector $b\in\mathbb{R}^n$, in $O(nd\log n)$ time we reduce the problem $\min_{x\in\mathbb{R}^d} \|Ax-b\|_p$ to the same problem with input matrix $\tilde A$ of dimension $s \times d$ and corresponding $\tilde b$ of dimension $s\times 1$. Here, $\tilde A$ and $\tilde b$ are a coreset for the problem, consisting of sampled and rescaled rows of $A$ and $b$; and $s$ is independent of $n$ and polynomial in $d$. Our results improve on the best previous algorithms when $n\gg d$, for all $p\in[1,\infty)$ except $p=2$. We also provide a suite of improved results for finding well-conditioned bases via ellipsoidal rounding, illustrating tradeoffs between running time and conditioning quality, including a one-pass conditioning algorithm for general $\ell_p$ problems. We also provide an empirical evaluation of implementations of our algorithms for $p=1$, comparing them with related algorithms. Our empirical results show that, in the asymptotic regime, the theory is a very good guide to the practical performance of these algorithms. Our algorithms use our faster constructions of well-conditioned bases for $\ell_p$ spaces and, for $p=1$, a fast subspace embedding of independent interest that we call the Fast Cauchy Transform: a distribution over matrices $Π:\mathbb{R}^n\mapsto \mathbb{R}^{O(d\log d)}$, found obliviously to $A$, that approximately preserves the $\ell_1$ norms: that is, with large probability, simultaneously for all $x$, $\|Ax\|_1 \approx \|ΠAx\|_1$, with distortion $O(d^{2+η})$, for an arbitrarily small constant $η>0$; and, moreover, $ΠA$ can be computed in $O(nd\log d)$ time. The techniques underlying our Fast Cauchy Transform include fast Johnson-Lindenstrauss transforms, low-coherence matrices, and rescaling by Cauchy random variables.

preprint2014arXiv

The Sketching Complexity of Graph Cuts

We study the problem of sketching an input graph, so that given the sketch, one can estimate the weight of any cut in the graph within factor $1+ε$. We present lower and upper bounds on the size of a randomized sketch, focusing on the dependence on the accuracy parameter $ε>0$. First, we prove that for every $ε> 1/\sqrt n$, every sketch that succeeds (with constant probability) in estimating the weight of all cuts $(S,\bar S)$ in an $n$-vertex graph (simultaneously), must be of size $Ω(n/ε^2)$ bits. In the special case where the sketch is itself a weighted graph (which may or may not be a subgraph) and the estimator is the sum of edge weights across the cut in the sketch, i.e., a cut sparsifier, we show the sketch must have $Ω(n/ε^2)$ edges, which is optimal. Despite the long sequence of work on graph sparsification, no such lower bound was known on the size of a cut sparsifier. We then design a randomized sketch that, given $ε\in(0,1)$ and an edge-weighted $n$-vertex graph, produces a sketch of size $\tilde O(n/ε)$ bits, from which the weight of any cut $(S,\bar S)$ can be reported, with high probability, within factor $1+ε$. The previous upper bound is $\tilde O(n/ε^2)$ bits, which follows by storing a cut sparsifier (Bencz{ú}r and Karger, 1996). To obtain this improvement, we critically use both that the sketch need only be correct on each fixed cut with high probability (rather than on all cuts), and that the estimation procedure of the data structure can be arbitrary (rather than a weighted subgraph). We also show a lower bound of $Ω(n/ε)$ bits for the space requirement of any data structure achieving this guarantee.

preprint2013arXiv

Low Rank Approximation and Regression in Input Sparsity Time

We design a new distribution over $\poly(r \eps^{-1}) \times n$ matrices $S$ so that for any fixed $n \times d$ matrix $A$ of rank $r$, with probability at least 9/10, $\norm{SAx}_2 = (1 \pm \eps)\norm{Ax}_2$ simultaneously for all $x \in \mathbb{R}^d$. Such a matrix $S$ is called a \emph{subspace embedding}. Furthermore, $SA$ can be computed in $\nnz(A) + \poly(d \eps^{-1})$ time, where $\nnz(A)$ is the number of non-zero entries of $A$. This improves over all previous subspace embeddings, which required at least $Ω(nd \log d)$ time to achieve this property. We call our matrices $S$ \emph{sparse embedding matrices}. Using our sparse embedding matrices, we obtain the fastest known algorithms for $(1+\eps)$-approximation for overconstrained least-squares regression, low-rank approximation, approximating all leverage scores, and $\ell_p$-regression. The leading order term in the time complexity of our algorithms is $O(\nnz(A))$ or $O(\nnz(A)\log n)$. We optimize the low-order $\poly(d/\eps)$ terms in our running times (or for rank-$k$ approximation, the $n*\poly(k/eps)$ term), and show various tradeoffs. For instance, we also use our methods to design new preconditioners that improve the dependence on $\eps$ in least squares regression to $\log 1/\eps$. Finally, we provide preliminary experimental results which suggest that our algorithms are competitive in practice.

preprint2013arXiv

The Round Complexity of Small Set Intersection

The set disjointness problem is one of the most fundamental and well-studied problems in communication complexity. In this problem Alice and Bob hold sets $S, T \subseteq [n]$, respectively, and the goal is to decide if $S \cap T = \emptyset$. Reductions from set disjointness are a canonical way of proving lower bounds in data stream algorithms, data structures, and distributed computation. In these applications, often the set sizes $|S|$ and $|T|$ are bounded by a value $k$ which is much smaller than $n$. This is referred to as small set disjointness. A major restriction in the above applications is the number of rounds that the protocol can make, which, e.g., translates to the number of passes in streaming applications. A fundamental question is thus in understanding the round complexity of the small set disjointness problem. For an essentially equivalent problem, called OR-Equality, Brody et. al showed that with $r$ rounds of communication, the randomized communication complexity is $Ω(k \ilog^r k)$, where$\ilog^r k$ denotes the $r$-th iterated logarithm function. Unfortunately their result requires the error probability of the protocol to be $1/k^{Θ(1)}$. Since naïve amplification of the success probability of a protocol from constant to $1-1/k^{Θ(1)}$ blows up the communication by a $Θ(\log k)$ factor, this destroys their improvements over the well-known lower bound of $Ω(k)$ which holds for any number of rounds. They pose it as an open question to achieve the same $Ω(k \ilog^r k)$ lower bound for protocols with constant error probability. We answer this open question by showing that the $r$-round randomized communication complexity of ${\sf OREQ}_{n,k}$, and thus also of small set disjointness, with {\it constant error probability} is $Ω(k \ilog^r k)$, asymptotically matching known upper bounds for ${\sf OREQ}_{n,k}$ and small set disjointness.

preprint2013arXiv

Tight Bounds for Distributed Functional Monitoring

We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008). In this model there are $k$ sites each tracking their input and communicating with a central coordinator that continuously maintain an approximate output to a function $f$ computed over the union of the inputs. The goal is to minimize the communication. We show the randomized communication complexity of estimating the number of distinct elements up to a $1+\eps$ factor is $\tildeΩ(k/\eps^2)$, improving the previous $Ω(k + 1/\eps^2)$ bound and matching known upper bounds up to a logarithmic factor. For the $p$-th frequency moment $F_p$, $p > 1$, we improve the previous $Ω(k + 1/\eps^2)$ communication bound to $\tildeΩ(k^{p-1}/\eps^2)$. We obtain similar improvements for heavy hitters, empirical entropy, and other problems. We also show that we can estimate $F_p$, for any $p > 1$, using $\tilde{O}(k^{p-1}\poly(\eps^{-1}))$ communication. This greatly improves upon the previous $\tilde{O}(k^{2p+1}N^{1-2/p} \poly(\eps^{-1}))$ bound of Cormode, Muthukrishnan, and Yi for general $p$, and their $\tilde{O}(k^2/\eps + k^{1.5}/\eps^3)$ bound for $p = 2$. For $p = 2$, our bound resolves their main open question. Our lower bounds are based on new direct sum theorems for approximate majority, and yield significant improvements to problems in the data stream model, improving the bound for estimating $F_p, p > 2,$ in $t$ passes from $\tildeΩ(n^{1-2/p}/(\eps^{2/p} t))$ to $\tildeΩ(n^{1-2/p}/(\eps^{4/p} t))$, giving the first bound for estimating $F_0$ in $t$ passes of $Ω(1/(\eps^2 t))$ bits of space that does not use the gap-hamming problem.

preprint2013arXiv

When Distributed Computation is Communication Expensive

We consider a number of fundamental statistical and graph problems in the message-passing model, where we have $k$ machines (sites), each holding a piece of data, and the machines want to jointly solve a problem defined on the union of the $k$ data sets. The communication is point-to-point, and the goal is to minimize the total communication among the $k$ machines. This model captures all point-to-point distributed computational models with respect to minimizing communication costs. Our analysis shows that exact computation of many statistical and graph problems in this distributed setting requires a prohibitively large amount of communication, and often one cannot improve upon the communication of the simple protocol in which all machines send their data to a centralized server. Thus, in order to obtain protocols that are communication-efficient, one has to allow approximation, or investigate the distribution or layout of the data sets.

preprint2012arXiv

Fast approximation of matrix coherence and statistical leverage

The statistical leverage scores of a matrix $A$ are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nyström-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary $n \times d$ matrix $A$, with $n \gg d$, and that returns as output relative-error approximations to all $n$ of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of $n$ and $d$) in $O(n d \log n)$ time, as opposed to the $O(nd^2)$ time required by the naïve algorithm that involves computing an orthogonal basis for the range of $A$. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with $n \approx d$, and the extension to streaming environments.

preprint2012arXiv

How Robust are Linear Sketches to Adaptive Inputs?

Linear sketches are powerful algorithmic tools that turn an n-dimensional input into a concise lower-dimensional representation via a linear transformation. Such sketches have seen a wide range of applications including norm estimation over data streams, compressed sensing, and distributed computing. In almost any realistic setting, however, a linear sketch faces the possibility that its inputs are correlated with previous evaluations of the sketch. Known techniques no longer guarantee the correctness of the output in the presence of such correlations. We therefore ask: Are linear sketches inherently non-robust to adaptively chosen inputs? We give a strong affirmative answer to this question. Specifically, we show that no linear sketch approximates the Euclidean norm of its input to within an arbitrary multiplicative approximation factor on a polynomial number of adaptively chosen inputs. The result remains true even if the dimension of the sketch is d = n - o(n) and the sketch is given unbounded computation time. Our result is based on an algorithm with running time polynomial in d that adaptively finds a distribution over inputs on which the sketch is incorrect with constant probability. Our result implies several corollaries for related problems including lp-norm estimation and compressed sensing. Notably, we resolve an open problem in compressed sensing regarding the feasibility of l2/l2-recovery guarantees in the presence of computationally bounded adversaries.

preprint2012arXiv

Lower Bounds for Adaptive Sparse Recovery

We give lower bounds for the problem of stable sparse recovery from /adaptive/ linear measurements. In this problem, one would like to estimate a vector $x \in \R^n$ from $m$ linear measurements $A_1x,..., A_mx$. One may choose each vector $A_i$ based on $A_1x,..., A_{i-1}x$, and must output $x*$ satisfying |x* - x|_p \leq (1 + ε) \min_{k\text{-sparse} x'} |x - x'|_p with probability at least $1-δ>2/3$, for some $p \in \{1,2\}$. For $p=2$, it was recently shown that this is possible with $m = O(\frac{1}εk \log \log (n/k))$, while nonadaptively it requires $Θ(\frac{1}εk \log (n/k))$. It is also known that even adaptively, it takes $m = Ω(k/ε)$ for $p = 2$. For $p = 1$, there is a non-adaptive upper bound of $\tilde{O}(\frac{1}{\sqrtε} k\log n)$. We show: * For $p=2$, $m = Ω(\log \log n)$. This is tight for $k = O(1)$ and constant $ε$, and shows that the $\log \log n$ dependence is correct. * If the measurement vectors are chosen in $R$ "rounds", then $m = Ω(R \log^{1/R} n)$. For constant $ε$, this matches the previously known upper bound up to an O(1) factor in $R$. * For $p=1$, $m = Ω(k/(\sqrtε \cdot \log k/ε))$. This shows that adaptivity cannot improve more than logarithmic factors, providing the analog of the $m = Ω(k/ε)$ bound for $p = 2$.

preprint2012arXiv

On Deterministic Sketching and Streaming for Sparse Recovery and Norm Estimation

We study classic streaming and sparse recovery problems using deterministic linear sketches, including l1/l1 and linf/l1 sparse recovery problems (the latter also being known as l1-heavy hitters), norm estimation, and approximate inner product. We focus on devising a fixed matrix A in R^{m x n} and a deterministic recovery/estimation procedure which work for all possible input vectors simultaneously. Our results improve upon existing work, the following being our main contributions: * A proof that linf/l1 sparse recovery and inner product estimation are equivalent, and that incoherent matrices can be used to solve both problems. Our upper bound for the number of measurements is m=O(eps^{-2}*min{log n, (log n / log(1/eps))^2}). We can also obtain fast sketching and recovery algorithms by making use of the Fast Johnson-Lindenstrauss transform. Both our running times and number of measurements improve upon previous work. We can also obtain better error guarantees than previous work in terms of a smaller tail of the input vector. * A new lower bound for the number of linear measurements required to solve l1/l1 sparse recovery. We show Omega(k/eps^2 + klog(n/k)/eps) measurements are required to recover an x' with |x - x'|_1 <= (1+eps)|x_{tail(k)}|_1, where x_{tail(k)} is x projected onto all but its largest k coordinates in magnitude. * A tight bound of m = Theta(eps^{-2}log(eps^2 n)) on the number of measurements required to solve deterministic norm estimation, i.e., to recover |x|_2 +/- eps|x|_1. For all the problems we study, tight bounds are already known for the randomized complexity from previous work, except in the case of l1/l1 sparse recovery, where a nearly tight bound is known. Our work thus aims to study the deterministic complexities of these problems.

preprint2011arXiv

(1+eps)-approximate Sparse Recovery

The problem central to sparse recovery and compressive sensing is that of stable sparse recovery: we want a distribution of matrices A in R^{m\times n} such that, for any x \in R^n and with probability at least 2/3 over A, there is an algorithm to recover x* from Ax with ||x* - x||_p <= C min_{k-sparse x'} ||x - x'||_p for some constant C > 1 and norm p. The measurement complexity of this problem is well understood for constant C > 1. However, in a variety of applications it is important to obtain C = 1 + eps for a small eps > 0, and this complexity is not well understood. We resolve the dependence on eps in the number of measurements required of a k-sparse recovery algorithm, up to polylogarithmic factors for the central cases of p = 1 and p = 2. Namely, we give new algorithms and lower bounds that show the number of measurements required is (1/eps^{p/2})k polylog(n). For p = 2, our bound of (1/eps) k log(n/k) is tight up to constant factors. We also give matching bounds when the output is required to be k-sparse, in which case we achieve (1/eps^p) k polylog(n). This shows the distinction between the complexity of sparse and non-sparse outputs is fundamental.

preprint2011arXiv

Lower Bounds for Sparse Recovery

We consider the following k-sparse recovery problem: design an m x n matrix A, such that for any signal x, given Ax we can efficiently recover x' satisfying ||x-x'||_1 <= C min_{k-sparse} x"} ||x-x"||_1. It is known that there exist matrices A with this property that have only O(k log (n/k)) rows. In this paper we show that this bound is tight. Our bound holds even for the more general /randomized/ version of the problem, where A is a random variable and the recovery algorithm is required to work for any fixed x with constant probability (over A).

preprint2011arXiv

On the Power of Adaptivity in Sparse Recovery

The goal of (stable) sparse recovery is to recover a $k$-sparse approximation $x*$ of a vector $x$ from linear measurements of $x$. Specifically, the goal is to recover $x*$ such that ||x-x*||_p <= C min_{k-sparse x'} ||x-x'||_q for some constant $C$ and norm parameters $p$ and $q$. It is known that, for $p=q=1$ or $p=q=2$, this task can be accomplished using $m=O(k \log (n/k))$ non-adaptive measurements [CRT06] and that this bound is tight [DIPW10,FPRU10,PW11]. In this paper we show that if one is allowed to perform measurements that are adaptive, then the number of measurements can be considerably reduced. Specifically, for $C=1+eps$ and $p=q=2$ we show - A scheme with $m=O((1/eps)k log log (n eps/k))$ measurements that uses $O(log* k \log \log (n eps/k))$ rounds. This is a significant improvement over the best possible non-adaptive bound. - A scheme with $m=O((1/eps) k log (k/eps) + k \log (n/k))$ measurements that uses /two/ rounds. This improves over the best possible non-adaptive bound. To the best of our knowledge, these are the first results of this type. As an independent application, we show how to solve the problem of finding a duplicate in a data stream of $n$ items drawn from ${1, 2, ..., n-1}$ using $O(log n)$ bits of space and $O(log log n)$ passes, improving over the best possible space complexity achievable using a single pass.

preprint2010arXiv

Fast Moment Estimation in Data Streams in Optimal Space

We give a space-optimal algorithm with update time O(log^2(1/eps)loglog(1/eps)) for (1+eps)-approximating the pth frequency moment, 0 < p < 2, of a length-n vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous space-optimal algorithm of [Kane-Nelson-Woodruff, SODA 2010], which had update time Omega(1/eps^2).

preprint2010arXiv

Sublinear Optimization for Machine Learning

We give sublinear-time approximation algorithms for some optimization problems arising in machine learning, such as training linear classifiers and finding minimum enclosing balls. Our algorithms can be extended to some kernelized versions of these problems, such as SVDD, hard margin SVM, and L2-SVM, for which sublinear-time algorithms were not known before. These new algorithms use a combination of a novel sampling techniques and a new multiplicative update algorithm. We give lower bounds which show the running times of many of our algorithms to be nearly best possible in the unit-cost RAM model. We also give implementations of our algorithms in the semi-streaming setting, obtaining the first low pass polylogarithmic space and sublinear time algorithms achieving arbitrary approximation factor.

David P. Woodruff

What is connected

Connect this record

See the researcher in context

Building this map preview

68 published item(s)

Online Learning with Limited Information in the Sliding Window Model

Progress on the Courtade-Kumar Conjecture: Optimal High-Noise Entropy Bounds and Generalized Coordinate-wise Mutual Information

Query-Efficient Locally Private Hypothesis Selection via the Scheffe Graph

Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming

Sharper Bounds for $\ell_p$ Sensitivity Sampling

Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time

Adaptive Sketches for Robust Regression with Importance Sampling

Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis

Fast Regression for Structured Inputs

Learning Augmented Binary Search Trees

Learning-Augmented $k$-means Clustering

Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time

Low-Rank Approximation with $1/ε^{1/3}$ Matrix-Vector Products

Memory Bounds for the Experts Problem

Noisy Boolean Hidden Matching with Applications

Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra

Separations for Estimating Large Frequency Moments on Data Streams

Sketching Algorithms and Lower Bounds for Ridge Regression

Streaming Algorithms with Large Approximation Factors

Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension

Triangle and Four Cycle Counting with Predictions in Graph Streams

A Framework for Adversarially Robust Streaming Algorithms

A PTAS for $\ell_p$-Low Rank Approximation

On Coresets for Logistic Regression

Subspace exploration: Bounds on Projected Frequency Estimation

Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss

Low Rank Approximation with Entrywise $\ell_1$-Norm Error

LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

Near Input Sparsity Time Kernel Embeddings via Adaptive Sampling

Non-Adaptive Adaptive Sampling on Turnstile Streams

Span Recovery for Deep Neural Networks with Applications to Input Obfuscation

Streaming Complexity of SVMs

Towards a Zero-One Law for Column Subset Selection

Vector-Matrix-Vector Queries for Solving Linear Algebra, Statistics, and Graph Problems

Weighted Maximum Independent Set of Geometric Objects in Turnstile Streams

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

Distributed Low Rank Approximation of Implicit Functions of a Matrix

New Algorithms for Heavy Hitters in Data Streams

Optimal approximate matrix product in terms of stable rank

Optimal Principal Component Analysis in Distributed and Streaming Models

Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors

Applications of Uniform Sampling: Densest Subgraph and Beyond

Beating CountSketch for Heavy Hitters in Insertion Streams

Frequent Directions : Simple and Deterministic Matrix Sketching

Input Sparsity and Hardness for Robust Subspace Approximation

Nearly-optimal bounds for sparse recovery in generic norms, with applications to $k$-median sketching

Sketching as a Tool for Numerical Linear Algebra

A Sketching Algorithm for Spectral Graph Sparsification

On The Communication Complexity of Linear Algebraic Problems in the Message Passing Model

Optimal CUR Matrix Decompositions

Subspace Embeddings and $\ell_p$-Regression Using Exponential Random Variables

The Fast Cauchy Transform and Faster Robust Linear Regression

The Sketching Complexity of Graph Cuts

Low Rank Approximation and Regression in Input Sparsity Time

The Round Complexity of Small Set Intersection

Tight Bounds for Distributed Functional Monitoring

When Distributed Computation is Communication Expensive

Fast approximation of matrix coherence and statistical leverage

How Robust are Linear Sketches to Adaptive Inputs?

Lower Bounds for Adaptive Sparse Recovery

On Deterministic Sketching and Streaming for Sparse Recovery and Norm Estimation

(1+eps)-approximate Sparse Recovery

Lower Bounds for Sparse Recovery

On the Power of Adaptivity in Sparse Recovery

Fast Moment Estimation in Data Streams in Optimal Space

Sublinear Optimization for Machine Learning