Source author record

Piotr Indyk

Piotr Indyk appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Information Theory math.IT Computational Geometry Machine Learning Computational Complexity math.FA Computer Vision Databases Discrete Mathematics Information Retrieval math.MG math.NA

Catalog footprint

What is connected

27works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Provable Quantization with Randomized Hadamard Transform

Vector quantization via random projection followed by scalar quantization is a fundamental primitive in machine learning, with applications ranging from similarity search to federated learning and KV cache compression. While dense random rotations yield clean theoretical guarantees, they require $Θ(d^2)$ time. The randomized Hadamard transform $HD$ reduces this cost to $O(d \log d)$, but its discrete structure complicates analysis and leads to weaker or purely empirical compression guarantees. In this work, we study a variant of this approach: dithered quantization with a single randomized Hadamard transform. Specifically, the quantizer applies $HD$ to the input vector and subtracts a random scalar offset before quantizing, injecting additional randomness at negligible cost. We prove that this approach is unbiased and provides mean squared error bounds that asymptotically match those achievable with truly random rotation matrices. In particular, we prove that a dithered version of TurboQuant achieves mean squared error $\bigl(π\sqrt{3}/2 + o(1)\bigr) \cdot 4^{-b}$ at $b$ bits per coordinate, where the $o(1)$ term vanishes uniformly over all unit vectors and all dimensions as the number of quantization levels grows.

preprint2023arXiv

Dimension-Accuracy Tradeoffs in Contrastive Embeddings for Triplets, Terminals & Top-k Nearest Neighbors

Metric embeddings traditionally study how to map $n$ items to a target metric space such that distance lengths are not heavily distorted; but what if we only care to preserve the relative order of the distances (and not their length)? In this paper, we are motivated by the following basic question: given triplet comparisons of the form ``item $i$ is closer to item $j$ than to item $k$,'' can we find low-dimensional Euclidean representations for the $n$ items that respect those distance comparisons? Such order-preserving embeddings naturally arise in important applications and have been studied since the 1950s, under the name of ordinal or non-metric embeddings. Our main results are: 1. Nearly-Tight Bounds on Triplet Dimension: We introduce the natural concept of triplet dimension of a dataset, and surprisingly, we show that in order for an ordinal embedding to be triplet-preserving, its dimension needs to grow as $\frac n2$ in the worst case. This is optimal (up to constant) as $n-1$ dimensions always suffice. 2. Tradeoffs for Dimension vs (Ordinal) Relaxation: We then relax the requirement that every triplet should be exactly preserved and present almost tight lower bounds for the maximum ratio between distances whose relative order was inverted by the embedding; this ratio is known as (ordinal) relaxation in the literature and serves as a counterpart to (metric) distortion. 3. New Bounds on Terminal and Top-$k$-NNs Embeddings: Going beyond triplets, we then study two well-motivated scenarios where we care about preserving specific sets of distances (not necessarily triplets). The first scenario is Terminal Ordinal Embeddings and the second scenario is top-$k$-NNs Ordinal Embeddings. To the best of our knowledge, these are some of the first tradeoffs on triplet-preserving ordinal embeddings and the first study of Terminal and Top-$k$-NNs Ordinal Embeddings.

preprint2022arXiv

Generalization Bounds for Data-Driven Numerical Linear Algebra

Data-driven algorithms can adapt their internal structure or parameters to inputs from unknown application-specific distributions, by learning from a training sample of inputs. Several recent works have applied this approach to problems in numerical linear algebra, obtaining significant empirical gains in performance. However, no theoretical explanation for their success was known. In this work we prove generalization bounds for those algorithms, within the PAC-learning framework for data-driven algorithm selection proposed by Gupta and Roughgarden (SICOMP 2017). Our main results are closely matching upper and lower bounds on the fat shattering dimension of the learning-based low rank approximation algorithm of Indyk et al.~(NeurIPS 2019). Our techniques are general, and provide generalization bounds for many other recently proposed data-driven algorithms in numerical linear algebra, covering both sketching-based and multigrid-based methods. This considerably broadens the class of data-driven algorithms for which a PAC-learning analysis is available.

preprint2022arXiv

Streaming Algorithms for Support-Aware Histograms

Histograms, i.e., piece-wise constant approximations, are a popular tool used to represent data distributions. Traditionally, the difference between the histogram and the underlying distribution (i.e., the approximation error) is measured using the $L_p$ norm, which sums the differences between the two functions over all items in the domain. Although useful in many applications, the drawback of this error measure is that it treats approximation errors of all items in the same way, irrespective of whether the mass of an item is important for the downstream application that uses the approximation. As a result, even relatively simple distributions cannot be approximated by succinct histograms without incurring large error. In this paper, we address this issue by adapting the definition of approximation so that only the errors of the items that belong to the support of the distribution are considered. Under this definition, we develop efficient 1-pass and 2-pass streaming algorithms that compute near-optimal histograms in sub-linear space. We also present lower bounds on the space complexity of this problem. Surprisingly, under this notion of error, there is an exponential gap in the space complexity of 1-pass and 2-pass streaming algorithms. Finally, we demonstrate the utility of our algorithms on a collection of real and synthetic data sets.

preprint2022arXiv

Targeted Supervised Contrastive Learning for Long-Tailed Recognition

Real-world data often exhibits long tail distributions with heavy class imbalance, where the majority classes can dominate the training process and alter the decision boundaries of the minority classes. Recently, researchers have investigated the potential of supervised contrastive learning for long-tailed recognition, and demonstrated that it provides a strong performance gain. In this paper, we show that while supervised contrastive learning can help improve performance, past baselines suffer from poor uniformity brought in by imbalanced data distribution. This poor uniformity manifests in samples from the minority class having poor separability in the feature space. To address this problem, we propose targeted supervised contrastive learning (TSC), which improves the uniformity of the feature distribution on the hypersphere. TSC first generates a set of targets uniformly distributed on a hypersphere. It then makes the features of different classes converge to these distinct and uniformly distributed targets during training. This forces all classes, including minority classes, to maintain a uniform distribution in the feature space, improves class boundaries, and provides better generalization even in the presence of long-tail data. Experiments on multiple datasets show that TSC achieves state-of-the-art performance on long-tailed recognition tasks.

preprint2022arXiv

Triangle and Four Cycle Counting with Predictions in Graph Streams

We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior "classical" algorithms that did not use oracles. In this paper, we explore the power of a "heavy edge" oracle in multiple graph edge streaming models. In the adjacency list model, we present a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on "classical" streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms.

preprint2020arXiv

(Learned) Frequency Estimation Algorithms under Zipfian Distribution

\begin{abstract} The frequencies of the elements in a data stream are an important statistical measure and the task of estimating them arises in many applications within data analysis and machine learning. Two of the most popular algorithms for this problem, Count-Min and Count-Sketch, are widely used in practice. In a recent work [Hsu et al., ICLR'19], it was shown empirically that augmenting Count-Min and Count-Sketch with a machine learning algorithm leads to a significant reduction of the estimation error. The experiments were complemented with an analysis of the expected error incurred by Count-Min (both the standard and the augmented version) when the input frequencies follow a Zipfian distribution. Although the authors established that the learned version of Count-Min has lower estimation error than its standard counterpart, their analysis of the standard Count-Min algorithm was not tight. Moreover, they provided no similar analysis for Count-Sketch. In this paper we resolve these problems. First, we provide a simple tight analysis of the expected error incurred by Count-Min. Second, we provide the first error bounds for both the standard and the augmented version of Count-Sketch. These bounds are nearly tight and again demonstrate an improved performance of the learned version of Count-Sketch. In addition to demonstrating tight gaps between the aforementioned algorithms, we believe that our bounds for the standard versions of Count-Min and Count-Sketch are of independent interest. In particular, it is a typical practice to set the number of hash functions in those algorithms to $Θ(\log n)$. In contrast, our results show that to minimize the \emph{expected} error, the number of hash functions should be a constant, strictly greater than $1$.

preprint2020arXiv

Online Page Migration with ML Advice

We consider online algorithms for the {\em page migration problem} that use predictions, potentially imperfect, to improve their performance. The best known online algorithms for this problem, due to Westbrook'94 and Bienkowski et al'17, have competitive ratios strictly bounded away from 1. In contrast, we show that if the algorithm is given a prediction of the input sequence, then it can achieve a competitive ratio that tends to $1$ as the prediction error rate tends to $0$. Specifically, the competitive ratio is equal to $1+O(q)$, where $q$ is the prediction error rate. We also design a ``fallback option'' that ensures that the competitive ratio of the algorithm for {\em any} input sequence is at most $O(1/q)$. Our result adds to the recent body of work that uses machine learning to improve the performance of ``classic'' algorithms.

preprint2016arXiv

Near-Optimal (Euclidean) Metric Compression

The metric sketching problem is defined as follows. Given a metric on $n$ points, and $ε>0$, we wish to produce a small size data structure (sketch) that, given any pair of point indices, recovers the distance between the points up to a $1+ε$ distortion. In this paper we consider metrics induced by $\ell_2$ and $\ell_1$ norms whose spread (the ratio of the diameter to the closest pair distance) is bounded by $Φ>0$. A well-known dimensionality reduction theorem due to Johnson and Lindenstrauss yields a sketch of size $O(ε^{-2} \log (Φn) n\log n)$, i.e., $O(ε^{-2} \log (Φn) \log n)$ bits per point. We show that this bound is not optimal, and can be substantially improved to $O(ε^{-2}\log(1/ε) \cdot \log n + \log\log Φ)$ bits per point. Furthermore, we show that our bound is tight up to a factor of $\log(1/ε)$. We also consider sketching of general metrics and provide a sketch of size $O(n\log(1/ε)+ \log\log Φ)$ bits per point, which we show is optimal.

preprint2016arXiv

Simultaneous Nearest Neighbor Search

Motivated by applications in computer vision and databases, we introduce and study the Simultaneous Nearest Neighbor Search (SNN) problem. Given a set of data points, the goal of SNN is to design a data structure that, given a collection of queries, finds a collection of close points that are compatible with each other. Formally, we are given $k$ query points $Q=q_1,\cdots,q_k$, and a compatibility graph $G$ with vertices in $Q$, and the goal is to return data points $p_1,\cdots,p_k$ that minimize (i) the weighted sum of the distances from $q_i$ to $p_i$ and (ii) the weighted sum, over all edges $(i,j)$ in the compatibility graph $G$, of the distances between $p_i$ and $p_j$. The problem has several applications, where one wants to return a set of consistent answers to multiple related queries. This generalizes well-studied computational problems, including NN, Aggregate NN and the 0-extension problem. In this paper we propose and analyze the following general two-step method for designing efficient data structures for SNN. In the first step, for each query point $q_i$ we find its (approximate) nearest neighbor point $\hat{p}_i$; this can be done efficiently using existing approximate nearest neighbor structures. In the second step, we solve an off-line optimization problem over sets $q_1,\cdots,q_k$ and $\hat{p}_1,\cdots,\hat{p}_k$; this can be done efficiently given that $k$ is much smaller than $n$. Even though $\hat{p}_1,\cdots,\hat{p}_k$ might not constitute the optimal answers to queries $q_1,\cdots,q_k$, we show that, for the unweighted case, the resulting algorithm is $O(\log k/\log \log k)$-approximation. Also, we show that the approximation factor can be in fact reduced to a constant for compatibility graphs frequently occurring in practice. Finally, we show that the "empirical approximation factor" provided by the above approach is very close to 1.

preprint2016arXiv

Towards Tight Bounds for the Streaming Set Cover Problem

We consider the classic Set Cover problem in the data stream model. For $n$ elements and $m$ sets ($m\geq n$) we give a $O(1/δ)$-pass algorithm with a strongly sub-linear $\tilde{O}(mn^δ)$ space and logarithmic approximation factor. This yields a significant improvement over the earlier algorithm of Demaine et al. [DIMV14] that uses exponentially larger number of passes. We complement this result by showing that the tradeoff between the number of passes and space exhibited by our algorithm is tight, at least when the approximation factor is equal to $1$. Specifically, we show that any algorithm that computes set cover exactly using $({1 \over 2δ}-1)$ passes must use $\tildeΩ(mn^δ)$ space in the regime of $m=O(n)$. Furthermore, we consider the problem in the geometric setting where the elements are points in $\mathbb{R}^2$ and sets are either discs, axis-parallel rectangles, or fat triangles in the plane, and show that our algorithm (with a slight modification) uses the optimal $\tilde{O}(n)$ space to find a logarithmic approximation in $O(1/δ)$ passes. Finally, we show that any randomized one-pass algorithm that distinguishes between covers of size 2 and 3 must use a linear (i.e., $Ω(mn)$) amount of space. This is the first result showing that a randomized, approximate algorithm cannot achieve a space bound that is sublinear in the input size. This indicates that using multiple passes might be necessary in order to achieve sub-linear space bounds for this problem while guaranteeing small approximation factors.

preprint2016arXiv

Which Regular Expression Patterns are Hard to Match?

Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an $O(mn)$ running time (where $m$ is the length of the pattern and $n$ is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its {\em depth} (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g., $[a|ab|bc]^*$. This corresponds to the so-called {\em word break} problem, for which a dynamic programming algorithm with a runtime of (roughly) $O(n\sqrt{m})$ is known. We show that the latter bound is not tight and improve the runtime to $O(nm^{0.44\ldots})$.

preprint2015arXiv

Approximation Algorithms for Model-Based Compressive Sensing

Compressive Sensing (CS) stipulates that a sparse signal can be recovered from a small number of linear measurements, and that this recovery can be performed efficiently in polynomial time. The framework of model-based compressive sensing (model-CS) leverages additional structure in the signal and prescribes new recovery schemes that can reduce the number of measurements even further. However, model-CS requires an algorithm that solves the model-projection problem: given a query signal, produce the signal in the model that is also closest to the query signal. Often, this optimization can be computationally very expensive. Moreover, an approximation algorithm is not sufficient for this optimization task. As a result, the model-projection problem poses a fundamental obstacle for extending model-CS to many interesting models. In this paper, we introduce a new framework that we call approximation-tolerant model-based compressive sensing. This framework includes a range of algorithms for sparse recovery that require only approximate solutions for the model-projection problem. In essence, our work removes the aforementioned obstacle to model-based compressive sensing, thereby extending the applicability of model-CS to a much wider class of models. We instantiate this new framework for the Constrained Earth Mover Distance (CEMD) model, which is particularly useful for signal ensembles where the positions of the nonzero coefficients do not change significantly as a function of spatial (or temporal) location. We develop novel approximation algorithms for both the maximization and the minimization versions of the model-projection problem via graph optimization techniques. Leveraging these algorithms into our framework results in a nearly sample-optimal sparse recovery scheme for the CEMD model.

preprint2015arXiv

Nearly Optimal Deterministic Algorithm for Sparse Walsh-Hadamard Transform

For every fixed constant $α> 0$, we design an algorithm for computing the $k$-sparse Walsh-Hadamard transform of an $N$-dimensional vector $x \in \mathbb{R}^N$ in time $k^{1+α} (\log N)^{O(1)}$. Specifically, the algorithm is given query access to $x$ and computes a $k$-sparse $\tilde{x} \in \mathbb{R}^N$ satisfying $\|\tilde{x} - \hat{x}\|_1 \leq c \|\hat{x} - H_k(\hat{x})\|_1$, for an absolute constant $c > 0$, where $\hat{x}$ is the transform of $x$ and $H_k(\hat{x})$ is its best $k$-sparse approximation. Our algorithm is fully deterministic and only uses non-adaptive queries to $x$ (i.e., all queries are determined and performed in parallel when the algorithm starts). An important technical tool that we use is a construction of nearly optimal and linear lossless condensers which is a careful instantiation of the GUV condenser (Guruswami, Umans, Vadhan, JACM 2009). Moreover, we design a deterministic and non-adaptive $\ell_1/\ell_1$ compressed sensing scheme based on general lossless condensers that is equipped with a fast reconstruction algorithm running in time $k^{1+α} (\log N)^{O(1)}$ (for the GUV-based condenser) and is of independent interest. Our scheme significantly simplifies and improves an earlier expander-based construction due to Berinde, Gilbert, Indyk, Karloff, Strauss (Allerton 2008). Our methods use linear lossless condensers in a black box fashion; therefore, any future improvement on explicit constructions of such condensers would immediately translate to improved parameters in our framework (potentially leading to $k (\log N)^{O(1)}$ reconstruction time with a reduced exponent in the poly-logarithmic factor, and eliminating the extra parameter $α$). Finally, by allowing the algorithm to use randomness, while still using non-adaptive queries, the running time of the algorithm can be improved to $\tilde{O}(k \log^3 N)$.

preprint2015arXiv

Nearly-optimal bounds for sparse recovery in generic norms, with applications to $k$-median sketching

We initiate the study of trade-offs between sparsity and the number of measurements in sparse recovery schemes for generic norms. Specifically, for a norm $\|\cdot\|$, sparsity parameter $k$, approximation factor $K>0$, and probability of failure $P>0$, we ask: what is the minimal value of $m$ so that there is a distribution over $m \times n$ matrices $A$ with the property that for any $x$, given $Ax$, we can recover a $k$-sparse approximation to $x$ in the given norm with probability at least $1-P$? We give a partial answer to this problem, by showing that for norms that admit efficient linear sketches, the optimal number of measurements $m$ is closely related to the doubling dimension of the metric induced by the norm $\|\cdot\|$ on the set of all $k$-sparse vectors. By applying our result to specific norms, we cast known measurement bounds in our general framework (for the $\ell_p$ norms, $p \in [1,2]$) as well as provide new, measurement-efficient schemes (for the Earth-Mover Distance norm). The latter result directly implies more succinct linear sketches for the well-studied planar $k$-median clustering problem. Finally, our lower bound for the doubling dimension of the EMD norm enables us to address the open question of [Frahling-Sohler, STOC'05] about the space complexity of clustering problems in the dynamic streaming model.

preprint2015arXiv

Practical and Optimal LSH for Angular Distance

We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn 2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [Charikar, 2002] in practice. We also introduce a multiprobe version of this algorithm, and conduct experimental evaluation on real and synthetic data sets. We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.

preprint2014arXiv

On Model-Based RIP-1 Matrices

The Restricted Isometry Property (RIP) is a fundamental property of a matrix enabling sparse recovery. Informally, an m x n matrix satisfies RIP of order k in the l_p norm if ||Ax||_p \approx ||x||_p for any vector x that is k-sparse, i.e., that has at most k non-zeros. The minimal number of rows m necessary for the property to hold has been extensively investigated, and tight bounds are known. Motivated by signal processing models, a recent work of Baraniuk et al has generalized this notion to the case where the support of x must belong to a given model, i.e., a given family of supports. This more general notion is much less understood, especially for norms other than l_2. In this paper we present tight bounds for the model-based RIP property in the l_1 norm. Our bounds hold for the two most frequently investigated models: tree-sparsity and block-sparsity. We also show implications of our results to sparse recovery problems.

preprint2014arXiv

Rapid Sampling for Visualizations with Ordering Guarantees

Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual proper- ties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes.

preprint2014arXiv

Sample-Optimal Fourier Sampling in Any Constant Dimension -- Part I

We give an algorithm for $\ell_2/\ell_2$ sparse recovery from Fourier measurements using $O(k\log N)$ samples, matching the lower bound of \cite{DIPW} for non-adaptive algorithms up to constant factors for any $k\leq N^{1-δ}$. The algorithm runs in $\tilde O(N)$ time. Our algorithm extends to higher dimensions, leading to sample complexity of $O_d(k\log N)$, which is optimal up to constant factors for any $d=O(1)$. These are the first sample optimal algorithms for these problems. A preliminary experimental evaluation indicates that our algorithm has empirical sampling complexity comparable to that of other recovery methods known in the literature, while providing strong provable guarantees on the recovery quality.

preprint2013arXiv

Beyond Locality-Sensitive Hashing

We present a new data structure for the c-approximate near neighbor problem (ANN) in the Euclidean space. For n points in R^d, our algorithm achieves O(n^ρ + d log n) query time and O(n^{1 + ρ} + d log n) space, where ρ<= 7/(8c^2) + O(1 / c^3) + o(1). This is the first improvement over the result by Andoni and Indyk (FOCS 2006) and the first data structure that bypasses a locality-sensitive hashing lower bound proved by O'Donnell, Wu and Zhou (ICS 2011). By a standard reduction we obtain a data structure for the Hamming space and \ell_1 norm with ρ<= 7/(8c) + O(1/c^{3/2}) + o(1), which is the first improvement over the result of Indyk and Motwani (STOC 1998).

preprint2013arXiv

Sample-Optimal Average-Case Sparse Fourier Transform in Two Dimensions

We present the first sample-optimal sublinear time algorithms for the sparse Discrete Fourier Transform over a two-dimensional sqrt{n} x sqrt{n} grid. Our algorithms are analyzed for /average case/ signals. For signals whose spectrum is exactly sparse, our algorithms use O(k) samples and run in O(k log k) time, where k is the expected sparsity of the signal. For signals whose spectrum is approximately sparse, our algorithm uses O(k log n) samples and runs in O(k log^2 n) time; the latter algorithm works for k=Theta(sqrt{n}). The number of samples used by our algorithms matches the known lower bounds for the respective signal models. By a known reduction, our algorithms give similar results for the one-dimensional sparse Discrete Fourier Transform when n is a power of a small composite number (e.g., n = 6^t).

preprint2012arXiv

Compressive Sensing with Local Geometric Features

We propose a framework for compressive sensing of images with local distinguishable objects, such as stars, and apply it to solve a problem in celestial navigation. Specifically, let x be an N-pixel real-valued image, consisting of a small number of local distinguishable objects plus noise. Our goal is to design an m-by-N measurement matrix A with m << N, such that we can recover an approximation to x from the measurements Ax. We construct a matrix A and recovery algorithm with the following properties: (i) if there are k objects, the number of measurements m is O((k log N)/(log k)), undercutting the best known bound of O(k log(N/k)) (ii) the matrix A is very sparse, which is important for hardware implementations of compressive sensing algorithms, and (iii) the recovery algorithm is empirically fast and runs in time polynomial in k and log(N). We also present a comprehensive study of the application of our algorithm to attitude determination, or finding one's orientation in space. Spacecraft typically use cameras to acquire an image of the sky, and then identify stars in the image to compute their orientation. Taking pictures is very expensive for small spacecraft, since camera sensors use a lot of power. Our algorithm optically compresses the image before it reaches the camera's array of pixels, reducing the number of sensors that are required.

preprint2012arXiv

K-Median Clustering, Model-Based Compressive Sensing, and Sparse Recovery for Earth Mover Distance

We initiate the study of sparse recovery problems under the Earth-Mover Distance (EMD). Specifically, we design a distribution over m x n matrices A such that for any x, given Ax, we can recover a k-sparse approximation to x under the EMD distance. One construction yields m = O(k log(n/k)) and a 1 + epsilon approximation factor, which matches the best achievable bound for other error measures, such as the L_1 norm. Our algorithms are obtained by exploiting novel connections to other problems and areas, such as streaming algorithms for k-median clustering and model-based compressive sensing. We also provide novel algorithms and results for the latter problems.

preprint2012arXiv

Nearly Optimal Sparse Fourier Transform

We consider the problem of computing the k-sparse approximation to the discrete Fourier transform of an n-dimensional signal. We show: * An O(k log n)-time randomized algorithm for the case where the input signal has at most k non-zero Fourier coefficients, and * An O(k log n log(n/k))-time randomized algorithm for general input signals. Both algorithms achieve o(n log n) time, and thus improve over the Fast Fourier Transform, for any k = o(n). They are the first known algorithms that satisfy this property. Also, if one assumes that the Fast Fourier Transform is optimal, the algorithm for the exactly k-sparse case is optimal for any k = n^{Ω(1)}. We complement our algorithmic results by showing that any algorithm for computing the sparse Fourier transform of a general signal must use at least Ω(k log(n/k)/ log log n) signal samples, even if it is allowed to perform adaptive sampling.

preprint2011arXiv

Lower Bounds for Sparse Recovery

We consider the following k-sparse recovery problem: design an m x n matrix A, such that for any signal x, given Ax we can efficiently recover x' satisfying ||x-x'||_1 <= C min_{k-sparse} x"} ||x-x"||_1. It is known that there exist matrices A with this property that have only O(k log (n/k)) rows. In this paper we show that this bound is tight. Our bound holds even for the more general /randomized/ version of the problem, where A is a random variable and the recovery algorithm is required to work for any fixed x with constant probability (over A).

preprint2011arXiv

On the Power of Adaptivity in Sparse Recovery

The goal of (stable) sparse recovery is to recover a $k$-sparse approximation $x*$ of a vector $x$ from linear measurements of $x$. Specifically, the goal is to recover $x*$ such that ||x-x*||_p <= C min_{k-sparse x'} ||x-x'||_q for some constant $C$ and norm parameters $p$ and $q$. It is known that, for $p=q=1$ or $p=q=2$, this task can be accomplished using $m=O(k \log (n/k))$ non-adaptive measurements [CRT06] and that this bound is tight [DIPW10,FPRU10,PW11]. In this paper we show that if one is allowed to perform measurements that are adaptive, then the number of measurements can be considerably reduced. Specifically, for $C=1+eps$ and $p=q=2$ we show - A scheme with $m=O((1/eps)k log log (n eps/k))$ measurements that uses $O(log* k \log \log (n eps/k))$ rounds. This is a significant improvement over the best possible non-adaptive bound. - A scheme with $m=O((1/eps) k log (k/eps) + k \log (n/k))$ measurements that uses /two/ rounds. This improves over the best possible non-adaptive bound. To the best of our knowledge, these are the first results of this type. As an independent application, we show how to solve the problem of finding a duplicate in a data stream of $n$ items drawn from ${1, 2, ..., n-1}$ using $O(log n)$ bits of space and $O(log log n)$ passes, improving over the best possible space complexity achievable using a single pass.

preprint2010arXiv

Almost-Euclidean subspaces of $\ell_1^N$ via tensor products: a simple approach to randomness reduction

It has been known since 1970's that the N-dimensional $\ell_1$-space contains nearly Euclidean subspaces whose dimension is $Ω(N)$. However, proofs of existence of such subspaces were probabilistic, hence non-constructive, which made the results not-quite-suitable for subsequently discovered applications to high-dimensional nearest neighbor search, error-correcting codes over the reals, compressive sensing and other computational problems. In this paper we present a "low-tech" scheme which, for any $a > 0$, allows to exhibit nearly Euclidean $Ω(N)$-dimensional subspaces of $\ell_1^N$ while using only $N^a$ random bits. Our results extend and complement (particularly) recent work by Guruswami-Lee-Wigderson. Characteristic features of our approach include (1) simplicity (we use only tensor products) and (2) yielding "almost Euclidean" subspaces with arbitrarily small distortions.

Piotr Indyk

What is connected

Connect this record

See the researcher in context

Building this map preview

27 published item(s)

Provable Quantization with Randomized Hadamard Transform

Dimension-Accuracy Tradeoffs in Contrastive Embeddings for Triplets, Terminals & Top-k Nearest Neighbors

Generalization Bounds for Data-Driven Numerical Linear Algebra

Streaming Algorithms for Support-Aware Histograms

Targeted Supervised Contrastive Learning for Long-Tailed Recognition

Triangle and Four Cycle Counting with Predictions in Graph Streams

(Learned) Frequency Estimation Algorithms under Zipfian Distribution

Online Page Migration with ML Advice

Near-Optimal (Euclidean) Metric Compression

Simultaneous Nearest Neighbor Search

Towards Tight Bounds for the Streaming Set Cover Problem

Which Regular Expression Patterns are Hard to Match?

Approximation Algorithms for Model-Based Compressive Sensing

Nearly Optimal Deterministic Algorithm for Sparse Walsh-Hadamard Transform

Nearly-optimal bounds for sparse recovery in generic norms, with applications to $k$-median sketching

Practical and Optimal LSH for Angular Distance

On Model-Based RIP-1 Matrices

Rapid Sampling for Visualizations with Ordering Guarantees

Sample-Optimal Fourier Sampling in Any Constant Dimension -- Part I

Beyond Locality-Sensitive Hashing

Sample-Optimal Average-Case Sparse Fourier Transform in Two Dimensions

Compressive Sensing with Local Geometric Features

K-Median Clustering, Model-Based Compressive Sensing, and Sparse Recovery for Earth Mover Distance

Nearly Optimal Sparse Fourier Transform

Lower Bounds for Sparse Recovery

On the Power of Adaptivity in Sparse Recovery

Almost-Euclidean subspaces of $\ell_1^N$ via tensor products: a simple approach to randomness reduction