Source author record

Atri Rudra

Atri Rudra appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computational Complexity Data Structures and Algorithms Databases Information Theory math.IT Machine Learning Discrete Mathematics Distributed, Parallel, and Cluster Computing Networking and Internet Architecture Computational Engineering, Finance, and Science Genomics math.CO Quantitative Methods

Catalog footprint

What is connected

31works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Computing expected multiplicities for bag-TIDBs with bounded multiplicities

In this work, we study the problem of computing a tuple's expected multiplicity over probabilistic databases with bag semantics (where each tuple is associated with a multiplicity) exactly and approximately. We consider bag-TIDBs where we have a bound $c$ on the maximum multiplicity of each tuple and tuples are independent probabilistic events (we refer to such databases as c-TIDBs. We are specifically interested in the fine-grained complexity of computing expected multiplicities and how it compares to the complexity of deterministic query evaluation algorithms -- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases. Unfortunately, our results imply that computing expected multiplicities for c-TIDBs based on the results produced by such query evaluation algorithms introduces super-linear overhead (under parameterized complexity hardness assumptions/conjectures). We proceed to study approximation of expected result tuple multiplicities for positive relational algebra queries ($RA^+$) over c-TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop a sampling algorithm that computes a 1$\pmε$ approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any $RA^+$ query.

preprint2022arXiv

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

preprint2022arXiv

General Strong Polarization

Arikan's exciting discovery of polar codes has provided an altogether new way to efficiently achieve Shannon capacity. Given a (constant-sized) invertible matrix $M$, a family of polar codes can be associated with this matrix and its ability to approach capacity follows from the {\em polarization} of an associated $[0,1]$-bounded martingale, namely its convergence in the limit to either $0$ or $1$. Arikan showed polarization of the martingale associated with the matrix $G_2 = \left(\begin{matrix} 1& 0 1& 1\end{matrix}\right)$ to get capacity achieving codes. His analysis was later extended to all matrices $M$ that satisfy an obvious necessary condition for polarization. While Arikan's theorem does not guarantee that the codes achieve capacity at small blocklengths, it turns out that a "strong" analysis of the polarization of the underlying martingale would lead to such constructions. Indeed for the martingale associated with $G_2$ such a strong polarization was shown in two independent works ([Guruswami and Xia, IEEE IT '15] and [Hassani et al., IEEE IT '14]), resolving a major theoretical challenge of the efficient attainment of Shannon capacity. In this work we extend the result above to cover martingales associated with all matrices that satisfy the necessary condition for (weak) polarization. In addition to being vastly more general, our proofs of strong polarization are also simpler and modular. Specifically, our result shows strong polarization over all prime fields and leads to efficient capacity-achieving codes for arbitrary symmetric memoryless channels. We show how to use our analyses to achieve exponentially small error probabilities at lengths inverse polynomial in the gap to capacity. Indeed we show that we can essentially match any error probability with lengths that are only inverse polynomial in the gap to capacity.

preprint2022arXiv

How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections

Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. However, the specific matrix that S4 uses was actually derived in previous work for a particular time-varying dynamical system, and the use of this matrix as a time-invariant SSM had no known mathematical interpretation. Consequently, the theoretical mechanism by which S4 models long-range dependencies actually remains unexplained. We derive a more general and intuitive formulation of the HiPPO framework, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies. Our generalization introduces a theoretically rich class of SSMs that also lets us derive more intuitive S4 variants for other bases such as the Fourier basis, and explains other aspects of training S4, such as how to initialize the important timescale parameter. These insights improve S4's performance to 86% on the Long Range Arena benchmark, with 96% on the most difficult Path-X task.

preprint2022arXiv

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones (e.g., sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efficiency--quality tradeoffs, and (2) in dense-to-sparse fine-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms). Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution. These properties of Monarch matrices unlock new ways to train and fine-tune sparse and dense models. We empirically validate that Monarch can achieve favorable accuracy-efficiency tradeoffs in several end-to-end sparse training applications: speeding up ViT and GPT-2 training on ImageNet classification and Wikitext-103 language modeling by 2x with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called "reverse sparsification," Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on OpenWebText by 2x without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse fine-tuning, as a proof-of-concept, our Monarch approximation algorithm speeds up BERT fine-tuning on GLUE by 1.7x with comparable accuracy.

preprint2022arXiv

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices is difficult and expensive. To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices. As butterfly matrices are not hardware efficient, we propose simple variants of butterfly (block and flat) to take advantage of modern hardware. Our method (Pixelated Butterfly) uses a simple fixed sparsity pattern based on flat block butterfly and low-rank matrices to sparsify most network layers (e.g., attention, MLP). We empirically validate that Pixelated Butterfly is 3x faster than butterfly and speeds up training to achieve favorable accuracy--efficiency tradeoffs. On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5x faster than the dense MLP-Mixer, Vision Transformer, and GPT-2 medium with no drop in accuracy.

preprint2021arXiv

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps. However, choosing which of the myriad structured transformations to use (and its associated parameterization) is a laborious task that requires trading off speed, space, and accuracy. We consider a different approach: we introduce a family of matrices called kaleidoscope matrices (K-matrices) that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity. We empirically validate that K-matrices can be automatically learned within end-to-end pipelines to replace hand-crafted procedures, in order to improve model quality. For example, replacing channel shuffles in ShuffleNet improves classification accuracy on ImageNet by up to 5%. K-matrices can also simplify hand-engineered pipelines -- we replace filter bank feature computation in speech data preprocessing with a learnable kaleidoscope layer, resulting in only 0.4% loss in accuracy on the TIMIT speech recognition task. In addition, K-matrices can capture latent structure in models: for a challenging permuted image classification task, a K-matrix based representation of permutations is able to learn the right latent structure and improves accuracy of a downstream convolutional model by over 9%. We provide a practically efficient implementation of our approach, and use K-matrices in a Transformer network to attain 36% faster end-to-end inference speed on a language translation task.

preprint2020arXiv

Covering the Relational Join

In this paper, we initiate a theoretical study of what we call the join covering problem. We are given a natural join query instance $Q$ on $n$ attributes and $m$ relations $(R_i)_{i \in [m]}$. Let $J_{Q} = \ \Join_{i=1}^m R_i$ denote the join output of $Q$. In addition to $Q$, we are given a parameter $Δ: 1\le Δ\le n$ and our goal is to compute the smallest subset $\mathcal{T}_{Q, Δ} \subseteq J_{Q}$ such that every tuple in $J_{Q}$ is within Hamming distance $Δ- 1$ from some tuple in $\mathcal{T}_{Q, Δ}$. The join covering problem captures both computing the natural join from database theory and constructing a covering code with covering radius $Δ- 1$ from coding theory, as special cases. We consider the combinatorial version of the join covering problem, where our goal is to determine the worst-case $|\mathcal{T}_{Q, Δ}|$ in terms of the structure of $Q$ and value of $Δ$. One obvious approach to upper bound $|\mathcal{T}_{Q, Δ}|$ is to exploit a distance property (of Hamming distance) from coding theory and combine it with the worst-case bounds on output size of natural joins (AGM bound hereon) due to Atserias, Grohe and Marx [SIAM J. of Computing'13]. Somewhat surprisingly, this approach is not tight even for the case when the input relations have arity at most two. Instead, we show that using the polymatroid degree-based bound of Abo Khamis, Ngo and Suciu [PODS'17] in place of the AGM bound gives us a tight bound (up to constant factors) on the $|\mathcal{T}_{Q, Δ}|$ for the arity two case. We prove lower bounds for $|\mathcal{T}_{Q, Δ}|$ using well-known classes of error-correcting codes e.g, Reed-Solomon codes. We can extend our results for the arity two case to general arity with a polynomial gap between our upper and lower bounds.

preprint2020arXiv

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural priors they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the $O(N \log N)$ Cooley-Tukey FFT algorithm to machine precision, for dimensions $N$ up to $1024$. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points -- the first time a structured approach has done so -- with 4X faster inference speed and 40X fewer parameters.

preprint2020arXiv

Topology Dependent Bounds For FAQs

In this paper, we prove topology dependent bounds on the number of rounds needed to compute Functional Aggregate Queries (FAQs) studied by Abo Khamis et al. [PODS 2016] in a synchronous distributed network under the model considered by Chattopadhyay et al. [FOCS 2014, SODA 2017]. Unlike the recent work on computing database queries in the Massively Parallel Computation model, in the model of Chattopadhyay et al., nodes can communicate only via private point-to-point channels and we are interested in bounds that work over an {\em arbitrary} communication topology. This is the first work to consider more practically motivated problems in this distributed model. For the sake of exposition, we focus on two special problems in this paper: Boolean Conjunctive Query (BCQ) and computing variable/factor marginals in Probabilistic Graphical Models (PGMs). We obtain tight bounds on the number of rounds needed to compute such queries as long as the underlying hypergraph of the query is $O(1)$-degenerate and has $O(1)$-arity. In particular, the $O(1)$-degeneracy condition covers most well-studied queries that are efficiently computable in the centralized computation model like queries with constant treewidth. These tight bounds depend on a new notion of `width' (namely internal-node-width) for Generalized Hypertree Decompositions (GHDs) of acyclic hypergraphs, which minimizes the number of internal nodes in a sub-class of GHDs. To the best of our knowledge, this width has not been studied explicitly in the theoretical database literature. Finally, we consider the problem of computing the product of a vector with a chain of matrices and prove tight bounds on its round complexity (over the finite field of two elements) using a novel min-entropy based argument.

preprint2016arXiv

Joins via Geometric Resolutions: Worst-case and Beyond

We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in Btrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution; transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.

preprint2016arXiv

Tight Network Topology Dependent Bounds on Rounds of Communication

We prove tight network topology dependent bounds on the round complexity of computing well studied $k$-party functions such as set disjointness and element distinctness. Unlike the usual case in the CONGEST model in distributed computing, we fix the function and then vary the underlying network topology. This complements the recent such results on total communication that have received some attention. We also present some applications to distributed graph computation problems. Our main contribution is a proof technique that allows us to reduce the problem on a general graph topology to a relevant two-party communication complexity problem. However, unlike many previous works that also used the same high level strategy, we do not reason about a two-party communication problem that is induced by a cut in the graph. To `stitch' back the various lower bounds from the two party communication problems, we use the notion of timed graph that has seen prior use in network coding. Our reductions use some tools from Steiner tree packing and multi-commodity flow problems that have a delay constraint.

preprint2015arXiv

Join Processing for Graph Patterns: An Old Dog with New Tricks

Join optimization has been dominated by Selinger-style, pairwise optimizers for decades. But, Selinger-style algorithms are asymptotically suboptimal for applications in graphic analytics. This suboptimality is one of the reasons that many have advocated supplementing relational engines with specialized graph processing engines. Recently, new join algorithms have been discovered that achieve optimal worst-case run times for any join or even so-called beyond worst-case (or instance optimal) run time guarantees for specialized classes of joins. These new algorithms match or improve on those used in specialized graph-processing systems. This paper asks can these new join algorithms allow relational engines to close the performance gap with graph engines? We examine this question for graph-pattern queries or join queries. We find that classical relational databases like Postgres and MonetDB or newer graph databases/stores like Virtuoso and Neo4j may be orders of magnitude slower than these new approaches compared to a fully featured RDBMS, LogicBlox, using these new ideas. Our results demonstrate that an RDBMS with such new algorithms can perform as well as specialized engines like GraphLab -- while retaining a high-level interface. We hope this adds to the ongoing debate of the role of graph accelerators, new graph systems, and relational systems in modern workloads.

preprint2015arXiv

The Range of Topological Effects on Communication

We continue the study of communication cost of computing functions when inputs are distributed among $k$ processors, each of which is located at one vertex of a network/graph called a terminal. Every other node of the network also has a processor, with no input. The communication is point-to-point and the cost is the total number of bits exchanged by the protocol, in the worst case, on all edges. Chattopadhyay, Radhakrishnan and Rudra (FOCS'14) recently initiated a study of the effect of topology of the network on the total communication cost using tools from $L_1$ embeddings. Their techniques provided tight bounds for simple functions like Element-Distinctness (ED), which depend on the 1-median of the graph. This work addresses two other kinds of natural functions. We show that for a large class of natural functions like Set-Disjointness the communication cost is essentially $n$ times the cost of the optimal Steiner tree connecting the terminals. Further, we show for natural composed functions like $\text{ED} \circ \text{XOR}$ and $\text{XOR} \circ \text{ED}$, the naive protocols suggested by their definition is optimal for general networks. Interestingly, the bounds for these functions depend on more involved topological parameters that are a combination of Steiner tree and 1-median costs. To obtain our results, we use some new tools in addition to ones used in Chattopadhyay et. al. These include (i) viewing the communication constraints via a linear program; (ii) using tools from the theory of tree embeddings to prove topology sensitive direct sum results that handle the case of composed functions and (iii) representing the communication constraints of certain problems as a family of collection of multiway cuts, where each multiway cut simulates the hardness of computing the function on the star topology.

preprint2014arXiv

Beyond Worst-Case Analysis for Joins with Minesweeper

We describe a new algorithm, Minesweeper, that is able to satisfy stronger runtime guarantees than previous join algorithms (colloquially, `beyond worst-case guarantees') for data in indexed search trees. Our first contribution is developing a framework to measure this stronger notion of complexity, which we call {\it certificate complexity}, that extends notions of Barbay et al. and Demaine et al.; a certificate is a set of propositional formulae that certifies that the output is correct. This notion captures a natural class of join algorithms. In addition, the certificate allows us to define a strictly stronger notion of runtime complexity than traditional worst-case guarantees. Our second contribution is to develop a dichotomy theorem for the certificate-based notion of complexity. Roughly, we show that Minesweeper evaluates $β$-acyclic queries in time linear in the certificate plus the output size, while for any $β$-cyclic query there is some instance that takes superlinear time in the certificate (and for which the output is no larger than the certificate size). We also extend our certificate-complexity analysis to queries with bounded treewidth and the triangle query.

preprint2014arXiv

It'll probably work out: improved list-decoding through random operations

In this work, we introduce a framework to study the effect of random operations on the combinatorial list-decodability of a code. The operations we consider correspond to row and column operations on the matrix obtained from the code by stacking the codewords together as columns. This captures many natural transformations on codes, such as puncturing, folding, and taking subcodes; we show that many such operations can improve the list-decoding properties of a code. There are two main points to this. First, our goal is to advance our (combinatorial) understanding of list-decodability, by understanding what structure (or lack thereof) is necessary to obtain it. Second, we use our more general results to obtain a few interesting corollaries for list decoding: (1) We show the existence of binary codes that are combinatorially list-decodable from $1/2-ε$ fraction of errors with optimal rate $Ω(ε^2)$ that can be encoded in linear time. (2) We show that any code with $Ω(1)$ relative distance, when randomly folded, is combinatorially list-decodable $1-ε$ fraction of errors with high probability. This formalizes the intuition for why the folding operation has been successful in obtaining codes with optimal list decoding parameters; previously, all arguments used algebraic methods and worked only with specific codes. (3) We show that any code which is list-decodable with suboptimal list sizes has many subcodes which have near-optimal list sizes, while retaining the error correcting capabilities of the original code. This generalizes recent results where subspace evasive sets have been used to reduce list sizes of codes that achieve list decoding capacity.

preprint2014arXiv

Sparse Approximation, List Decoding, and Uncertainty Principles

We consider list versions of sparse approximation problems, where unlike the existing results in sparse approximation that consider situations with unique solutions, we are interested in multiple solutions. We introduce these problems and present the first combinatorial results on the output list size. These generalize and enhance some of the existing results on threshold phenomenon and uncertainty principles in sparse approximations. Our definitions and results are inspired by similar results in list decoding. We also present lower bound examples that bolster our results and show they are of the appropriate size.

preprint2013arXiv

Accurate Decoding of Pooled Sequenced Data Using Compressed Sensing

In order to overcome the limitations imposed by DNA barcoding when multiplexing a large number of samples in the current generation of high-throughput sequencing instruments, we have recently proposed a new protocol that leverages advances in combinatorial pooling design (group testing) doi:10.1371/journal.pcbi.1003010. We have also demonstrated how this new protocol would enable de novo selective sequencing and assembly of large, highly-repetitive genomes. Here we address the problem of decoding pooled sequenced data obtained from such a protocol. Our algorithm employs a synergistic combination of ideas from compressed sensing and the decoding of error-correcting codes. Experimental results on synthetic data for the rice genome and real data for the barley genome show that our novel decoding algorithm enables significantly higher quality assemblies than the previous approach.

preprint2013arXiv

Every list-decodable code for high noise has abundant near-optimal rate puncturings

We show that any q-ary code with sufficiently good distance can be randomly punctured to obtain, with high probability, a code that is list decodable up to radius $1 - 1/q - ε$ with near-optimal rate and list sizes. Our results imply that "most" Reed-Solomon codes are list decodable beyond the Johnson bound, settling the long-standing open question of whether any Reed Solomon codes meet this criterion. More precisely, we show that a Reed-Solomon code with random evaluation points is, with high probability, list decodable up to radius $1 - ε$ with list sizes $O(1/ε)$ and rate $Ω(ε)$. As a second corollary of our argument, we obtain improved bounds on the list decodability of random linear codes over large fields. Our approach exploits techniques from high dimensional probability. Previous work used similar tools to obtain bounds on the list decodability of random linear codes, but the bounds did not scale with the size of the alphabet. In this paper, we use a chaining argument to deal with large alphabet sizes.

preprint2013arXiv

L2/L2-foreach sparse recovery with low risk

In this paper, we consider the "foreach" sparse recovery problem with failure probability $p$. The goal of which is to design a distribution over $m \times N$ matrices $Φ$ and a decoding algorithm $\algo$ such that for every $\vx\in\R^N$, we have the following error guarantee with probability at least $1-p$ \[\|\vx-\algo(Φ\vx)\|_2\le C\|\vx-\vx_k\|_2,\] where $C$ is a constant (ideally arbitrarily close to 1) and $\vx_k$ is the best $k$-sparse approximation of $\vx$. Much of the sparse recovery or compressive sensing literature has focused on the case of either $p = 0$ or $p = Ω(1)$. We initiate the study of this problem for the entire range of failure probability. Our two main results are as follows: \begin{enumerate} \item We prove a lower bound on $m$, the number measurements, of $Ω(k\log(n/k)+\log(1/p))$ for $2^{-Θ(N)}\le p <1$. Cohen, Dahmen, and DeVore \cite{CDD2007:NearOptimall2l2} prove that this bound is tight. \item We prove nearly matching upper bounds for \textit{sub-linear} time decoding. Previous such results addressed only $p = Ω(1)$. \end{enumerate} Our results and techniques lead to the following corollaries: (i) the first ever sub-linear time decoding $\lolo$ "forall" sparse recovery system that requires a $\log^γ{N}$ extra factor (for some $γ<1$) over the optimal $O(k\log(N/k))$ number of measurements, and (ii) extensions of Gilbert et al. \cite{GHRSW12:SimpleSignals} results for information-theoretically bounded adversaries.

preprint2013arXiv

Skew Strikes Back: New Developments in the Theory of Join Algorithms

Evaluating the relational join is one of the central algorithmic and most well-studied problems in database systems. A staggering number of variants have been considered including Block-Nested loop join, Hash-Join, Grace, Sort-merge for discussions of more modern issues). Commercial database engines use finely tuned join heuristics that take into account a wide variety of factors including the selectivity of various predicates, memory, IO, etc. In spite of this study of join queries, the textbook description of join processing is suboptimal. This survey describes recent results on join algorithms that have provable worst-case optimality runtime guarantees. We survey recent work and provide a simpler and unified description of these algorithms that we hope is useful for theory-minded readers, algorithm designers, and systems implementors.

preprint2012arXiv

Almost Universal Hash Families are also Storage Enforcing

We show that every almost universal hash function also has the storage enforcement property. Almost universal hash functions have found numerous applications and we show that this new storage enforcement property allows the application of almost universal hash functions in a wide range of remote verification tasks: (i) Proof of Secure Erasure (where we want to remotely erase and securely update the code of a compromised machine with memory-bounded adversary), (ii) Proof of Ownership (where a storage server wants to check if a client has the data it claims to have before giving access to deduplicated data) and (iii) Data possession (where the client wants to verify whether the remote storage server is storing its data). Specifically, storage enforcement guarantee in the classical data possession problem removes any practical incentive for the storage server to cheat the client by saving on storage space. The proof of our result relies on a natural combination of Kolmogorov Complexity and List Decoding. To the best of our knowledge this is the first work that combines these two techniques. We believe the newly introduced storage enforcement property of almost universal hash functions will open promising avenues of exciting research under memory-bounded (bounded storage) adversary model.

preprint2012arXiv

Analyzing Nonblocking Switching Networks using Linear Programming (Duality)

The main task in analyzing a switching network design (including circuit-, multirate-, and photonic-switching) is to determine the minimum number of some switching components so that the design is non-blocking in some sense (e.g., strict- or wide-sense). We show that, in many cases, this task can be accomplished with a simple two-step strategy: (1) formulate a linear program whose optimum value is a bound for the minimum number we are seeking, and (2) specify a solution to the dual program, whose objective value by weak duality immediately yields a sufficient condition for the design to be non-blocking. We illustrate this technique through a variety of examples, ranging from circuit to multirate to photonic switching, from unicast to $f$-cast and multicast, and from strict- to wide-sense non-blocking. The switching architectures in the examples are of Clos-type and Banyan-type, which are the two most popular architectural choices for designing non-blocking switching networks. To prove the result in the multirate Clos network case, we formulate a new problem called {\sc dynamic weighted edge coloring} which generalizes the {\sc dynamic bin packing} problem. We then design an algorithm with competitive ratio 5.6355 for the problem. The algorithm is analyzed using the linear programming technique. A new upper-bound for multirate wide-sense non-blocking Clos networks follow, improving upon a decade-old bound on the same problem.

preprint2012arXiv

Simulating Special but Natural Quantum Circuits

We identify a sub-class of BQP that captures certain structural commonalities among many quantum algorithms including Shor's algorithms. This class does not contain all of BQP (e.g. Grover's algorithm does not fall into this class). Our main result is that any algorithm in this class that measures at most O(log n) qubits can be simulated by classical randomized polynomial time algorithms. This does not dequantize Shor's algorithm (as the latter measures n qubits) but our work also highlights a new potentially hard function for cryptographic applications. Our main technical contribution is (to the best of our knowledge) a new exact characterization of certain sums of Fourier-type coefficients (with exponentially many summands).

preprint2012arXiv

Worst-case Optimal Join Algorithms

Efficient join processing is one of the most fundamental and well-studied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a novel algorithm to process these queries optimally in terms of worst-case data complexity. Our result builds on recent work by Atserias, Grohe, and Marx, who gave bounds on the size of a full conjunctive query in terms of the sizes of the individual relations in the body of the query. These bounds, however, are not constructive: they rely on Shearer's entropy inequality which is information-theoretic. Thus, the previous results leave open the question of whether there exist algorithms whose running time achieve these optimal bounds. An answer to this question may be interesting to database practice, as it is known that any algorithm based on the traditional select-project-join style plans typically employed in an RDBMS are asymptotically slower than the optimal for some queries. We construct an algorithm whose running time is worst-case optimal for all natural join queries. Our result may be of independent interest, as our algorithm also yields a constructive proof of the general fractional cover bound by Atserias, Grohe, and Marx without using Shearer's inequality. This bound implies two famous inequalities in geometry: the Loomis-Whitney inequality and the Bollobás-Thomason inequality. Hence, our results algorithmically prove these inequalities as well. Finally, we discuss how our algorithm can be used to compute a relaxed notion of joins.

preprint2011arXiv

An FPTAS for the Lead-Based Multiple Video Transmission LMVT Problem

The Lead-Based Multiple Video Transmission (LMVT) problem is motivated by applications in managing the quality of experience (QoE) of video streaming for mobile clients. In an earlier work, the LMVT problem has been shown to be NP-hard for a specific bit-to-lead conversion function $ϕ$. In this work, we show the problem to be NP-hard even if the function $ϕ$ is linear. We then design a fully polynomial time approximation scheme (FPTAS) for the problem. This problem is exactly equivalent to the Santa Clause Problem on which there has been a lot of work done off-late.

preprint2011arXiv

Storage Enforcement with Kolmogorov Complexity and List Decoding

We consider the following problem that arises in outsourced storage: a user stores her data $x$ on a remote server but wants to audit the server at some later point to make sure it actually did store $x$. The goal is to design a (randomized) verification protocol that has the property that if the server passes the verification with some reasonably high probability then the user can rest assured that the server is storing $x$. In this work we present an optimal solution (in terms of the user's storage and communication) while at the same time ensuring that a server that passes the verification protocol with any reasonable probability will store, to within a small \textit{additive} factor, $C(x)$ bits of information, where $C(x)$ is the plain Kolmogorov complexity of $x$. (Since we cannot prevent the server from compressing $x$, $C(x)$ is a natural upper bound.) The proof of security of our protocol combines Kolmogorov complexity with list decoding and unlike previous work that relies upon cryptographic assumptions, we allow the server to have unlimited computational power. To the best of our knowledge, this is the first work that combines Kolmogorov complexity and list decoding. Our framework is general enough to capture extensions where the user splits up $x$ and stores the fragment across multiple servers and our verification protocol can handle non-responsive servers and colluding servers. As a by-product, we also get a proof of retrievability. Finally, our results also have an application in `storage enforcement' schemes, which in turn have an application in trying to update a remote server that is potentially infected with a virus.

preprint2010arXiv

Data Stream Algorithms for Codeword Testing

Motivated by applications in storage systems and property testing, we study data stream algorithms for local testing and tolerant testing of codes. Ideally, we would like to know whether there exist asymptotically good codes that can be local/tolerant tested with one-pass, poly-log space data stream algorithms. We show that for the error detection problem (and hence, the local testing problem), there exists a one-pass, log-space data stream algorithm for a broad class of asymptotically good codes, including the Reed-Solomon (RS) code and expander codes. In our technically more involved result, we give a one-pass, $O(e\log^2{n})$-space algorithm for RS (and related) codes with dimension $k$ and block length $n$ that can distinguish between the cases when the Hamming distance between the received word and the code is at most $e$ and at least $a\cdot e$ for some absolute constant $a>1$. For RS codes with random errors, we can obtain $e\le O(n/k)$. For folded RS codes, we obtain similar results for worst-case errors as long as $e\le (n/k)^{1-\eps}$ for any constant $\eps>0$. These results follow by reducing the tolerant testing problem to the error detection problem using results from group testing and the list decodability of the code. We also show that using our techniques, the space requirement and the upper bound of $e\le O(n/k)$ cannot be improved by more than logarithmic factors.

preprint2010arXiv

Two Theorems in List Decoding

We prove the following results concerning the list decoding of error-correcting codes: (i) We show that for \textit{any} code with a relative distance of $δ$ (over a large enough alphabet), the following result holds for \textit{random errors}: With high probability, for a $ρ\le δ-\eps$ fraction of random errors (for any $\eps>0$), the received word will have only the transmitted codeword in a Hamming ball of radius $ρ$ around it. Thus, for random errors, one can correct twice the number of errors uniquely correctable from worst-case errors for any code. A variant of our result also gives a simple algorithm to decode Reed-Solomon codes from random errors that, to the best of our knowledge, runs faster than known algorithms for certain ranges of parameters. (ii) We show that concatenated codes can achieve the list decoding capacity for erasures. A similar result for worst-case errors was proven by Guruswami and Rudra (SODA 08), although their result does not directly imply our result. Our results show that a subset of the random ensemble of codes considered by Guruswami and Rudra also achieve the list decoding capacity for erasures. Our proofs employ simple counting and probabilistic arguments.

preprint2010arXiv

When LP is the Cure for Your Matching Woes: Approximating Stochastic Matchings

This results in this paper have been merged with the result in arXiv:1002.3763v1 The authors would like to withdraw this version. Please see arXiv:1008.5356v1 for the merged version.

preprint2010arXiv

When LP is the Cure for Your Matching Woes: Improved Bounds for Stochastic Matchings

Consider a random graph model where each possible edge $e$ is present independently with some probability $p_e$. Given these probabilities, we want to build a large/heavy matching in the randomly generated graph. However, the only way we can find out whether an edge is present or not is to query it, and if the edge is indeed present in the graph, we are forced to add it to our matching. Further, each vertex $i$ is allowed to be queried at most $t_i$ times. How should we adaptively query the edges to maximize the expected weight of the matching? We consider several matching problems in this general framework (some of which arise in kidney exchanges and online dating, and others arise in modeling online advertisements); we give LP-rounding based constant-factor approximation algorithms for these problems. Our main results are the following: We give a 4 approximation for weighted stochastic matching on general graphs, and a 3 approximation on bipartite graphs. This answers an open question from [Chen etal ICALP 09]. Combining our LP-rounding algorithm with the natural greedy algorithm, we give an improved 3.46 approximation for unweighted stochastic matching on general graphs. We introduce a generalization of the stochastic online matching problem [Feldman etal FOCS 09] that also models preference-uncertainty and timeouts of buyers, and give a constant factor approximation algorithm.

Atri Rudra

What is connected

Connect this record

See the researcher in context

Building this map preview

31 published item(s)

Computing expected multiplicities for bag-TIDBs with bounded multiplicities

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

General Strong Polarization

How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

Covering the Relational Join

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Topology Dependent Bounds For FAQs

Joins via Geometric Resolutions: Worst-case and Beyond

Tight Network Topology Dependent Bounds on Rounds of Communication

Join Processing for Graph Patterns: An Old Dog with New Tricks

The Range of Topological Effects on Communication

Beyond Worst-Case Analysis for Joins with Minesweeper

It'll probably work out: improved list-decoding through random operations

Sparse Approximation, List Decoding, and Uncertainty Principles

Accurate Decoding of Pooled Sequenced Data Using Compressed Sensing

Every list-decodable code for high noise has abundant near-optimal rate puncturings

L2/L2-foreach sparse recovery with low risk

Skew Strikes Back: New Developments in the Theory of Join Algorithms

Almost Universal Hash Families are also Storage Enforcing

Analyzing Nonblocking Switching Networks using Linear Programming (Duality)

Simulating Special but Natural Quantum Circuits

Worst-case Optimal Join Algorithms

An FPTAS for the Lead-Based Multiple Video Transmission LMVT Problem

Storage Enforcement with Kolmogorov Complexity and List Decoding

Data Stream Algorithms for Codeword Testing

Two Theorems in List Decoding

When LP is the Cure for Your Matching Woes: Approximating Stochastic Matchings

When LP is the Cure for Your Matching Woes: Improved Bounds for Stochastic Matchings