Source author record

Rameshwar Pratap

Rameshwar Pratap appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Complexity Computational Geometry Machine Learning Databases math.ST Statistics Theory

Catalog footprint

What is connected

8works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Improving \textit{Tug-of-War} sketch using Control-Variates method

Computing space-efficient summary, or \textit{a.k.a. sketches}, of large data, is a central problem in the streaming algorithm. Such sketches are used to answer \textit{post-hoc} queries in several data analytics tasks. The algorithm for computing sketches typically requires to be fast, accurate, and space-efficient. A fundamental problem in the streaming algorithm framework is that of computing the frequency moments of data streams. The frequency moments of a sequence containing $f_i$ elements of type $i$, are the numbers $\mathbf{F}_k=\sum_{i=1}^n {f_i}^k,$ where $i\in [n]$. This is also called as $\ell_k$ norm of the frequency vector $(f_1, f_2, \ldots f_n).$ Another important problem is to compute the similarity between two data streams by computing the inner product of the corresponding frequency vectors. The seminal work of Alon, Matias, and Szegedy~\cite{AMS}, \textit{a.k.a. Tug-of-war} (or AMS) sketch gives a randomized sublinear space (and linear time) algorithm for computing the frequency moments, and the inner product between two frequency vectors corresponding to the data streams. However, the variance of these estimates typically tends to be large. In this work, we focus on minimizing the variance of these estimates. We use the techniques from the classical Control-Variate method~\cite{Lavenberg} which is primarily known for variance reduction in Monte-Carlo simulations, and as a result, we are able to obtain significant variance reduction, at the cost of a little computational overhead. We present a theoretical analysis of our proposal and complement it with supporting experiments on synthetic as well as real-world datasets.

preprint2022arXiv

One-pass additive-error subset selection for $\ell_{p}$ subspace approximation

We consider the problem of subset selection for $\ell_{p}$ subspace approximation, that is, to efficiently find a \emph{small} subset of data points such that solving the problem optimally for this subset gives a good approximation to solving the problem optimally for the original input. Previously known subset selection algorithms based on volume sampling and adaptive sampling \cite{DeshpandeV07}, for the general case of $p \in [1, \infty)$, require multiple passes over the data. In this paper, we give a one-pass subset selection with an additive approximation guarantee for $\ell_{p}$ subspace approximation, for any $p \in [1, \infty)$. Earlier subset selection algorithms that give a one-pass multiplicative $(1+ε)$ approximation work under the special cases. Cohen \textit{et al.} \cite{CohenMM17} gives a one-pass subset section that offers multiplicative $(1+ε)$ approximation guarantee for the special case of $\ell_{2}$ subspace approximation. Mahabadi \textit{et al.} \cite{MahabadiRWZ20} gives a one-pass \emph{noisy} subset selection with $(1+ε)$ approximation guarantee for $\ell_{p}$ subspace approximation when $p \in \{1, 2\}$. Our subset selection algorithm gives a weaker, additive approximation guarantee, but it works for any $p \in [1, \infty)$.

preprint2020arXiv

Subspace approximation with outliers

The subspace approximation problem with outliers, for given $n$ points in $d$ dimensions $x_{1},\ldots, x_{n} \in R^{d}$, an integer $1 \leq k \leq d$, and an outlier parameter $0 \leq α\leq 1$, is to find a $k$-dimensional linear subspace of $R^{d}$ that minimizes the sum of squared distances to its nearest $(1-α)n$ points. More generally, the $\ell_{p}$ subspace approximation problem with outliers minimizes the sum of $p$-th powers of distances instead of the sum of squared distances. Even the case of robust PCA is non-trivial, and previous work requires additional assumptions on the input. Any multiplicative approximation algorithm for the subspace approximation problem with outliers must solve the robust subspace recovery problem, a special case in which the $(1-α)n$ inliers in the optimal solution are promised to lie exactly on a $k$-dimensional linear subspace. However, robust subspace recovery is Small Set Expansion (SSE)-hard. We show how to extend dimension reduction techniques and bi-criteria approximations based on sampling to the problem of subspace approximation with outliers. To get around the SSE-hardness of robust subspace recovery, we assume that the squared distance error of the optimal $k$-dimensional subspace summed over the optimal $(1-α)n$ inliers is at least $δ$ times its squared-error summed over all $n$ points, for some $0 < δ\leq 1 - α$. With this assumption, we give an efficient algorithm to find a subset of $poly(k/ε) \log(1/δ) \log\log(1/δ)$ points whose span contains a $k$-dimensional subspace that gives a multiplicative $(1+ε)$-approximation to the optimal solution. The running time of our algorithm is linear in $n$ and $d$. Interestingly, our results hold even when the fraction of outliers $α$ is large, as long as the obvious condition $0 < δ\leq 1 - α$ is satisfied.

preprint2016arXiv

Frequent-Itemset Mining using Locality-Sensitive Hashing

The Apriori algorithm is a classical algorithm for the frequent itemset mining problem. A significant bottleneck in Apriori is the number of I/O operation involved, and the number of candidates it generates. We investigate the role of LSH techniques to overcome these problems, without adding much computational overhead. We propose randomized variations of Apriori that are based on asymmetric LSH defined over Hamming distance and Jaccard similarity.

preprint2016arXiv

Similarity preserving compressions of high dimensional sparse data

The rise of internet has resulted in an explosion of data consisting of millions of articles, images, songs, and videos. Most of this data is high dimensional and sparse. The need to perform an efficient search for similar objects in such high dimensional big datasets is becoming increasingly common. Even with the rapid growth in computing power, the brute-force search for such a task is impractical and at times impossible. Therefore it is quite natural to investigate the techniques that compress the dimension of the data-set while preserving the similarity between data objects. In this work, we propose an efficient compression scheme mapping binary vectors into binary vectors and simultaneously preserving Hamming distance and Inner Product. The length of our compression depends only on the sparsity and is independent of the dimension of the data. Moreover our schemes provide one-shot solution for Hamming distance and Inner Product, and work in the streaming setting as well. In contrast with the "local projection" strategies used by most of the previous schemes, our scheme combines (using sparsity) the following two strategies: $1.$ Partitioning the dimensions into several buckets, $2.$ Then obtaining "global linear summaries" in each of these buckets. We generalize our scheme for real-valued data and obtain compressions for Euclidean distance, Inner Product, and $k$-way Inner Product.

preprint2016arXiv

Testing Uniformity of Stationary Distribution

A random walk on a directed graph gives a Markov chain on the vertices of the graph. An important question that arises often in the context of Markov chain is whether the uniform distribution on the vertices of the graph is a stationary distribution of the Markov chain. Stationary distribution of a Markov chain is a global property of the graph. In this paper, we prove that for a regular directed graph whether the uniform distribution on the vertices of the graph is a stationary distribution, depends on a local property of the graph, namely if (u,v) is an directed edge then outdegree(u) is equal to indegree(v). This result also has an application to the problem of testing whether a given distribution is uniform or "far" from being uniform. This is a well studied problem in property testing and statistics. If the distribution is the stationary distribution of the lazy random walk on a directed graph and the graph is given as an input, then how many bits of the input graph do one need to query in order to decide whether the distribution is uniform or "far" from it? This is a problem of graph property testing and we consider this problem in the orientation model (introduced by Halevy et al.). We reduce this problem to test (in the orientation model) whether a directed graph is Eulerian. And using result of Fischer et al. on query complexity of testing (in the orientation model) whether a graph is Eulerian, we obtain bounds on the query complexity for testing whether the stationary distribution is uniform.

preprint2013arXiv

Helly-Type Theorems in Property Testing

Helly's theorem is a fundamental result in discrete geometry, describing the ways in which convex sets intersect with each other. If $S$ is a set of $n$ points in $R^d$, we say that $S$ is $(k,G)$-clusterable if it can be partitioned into $k$ clusters (subsets) such that each cluster can be contained in a translated copy of a geometric object $G$. In this paper, as an application of Helly's theorem, by taking a constant size sample from $S$, we present a testing algorithm for $(k,G)$-clustering, i.e., to distinguish between two cases: when $S$ is $(k,G)$-clusterable, and when it is $ε$-far from being $(k,G)$-clusterable. A set $S$ is $ε$-far $(0<ε\leq1)$ from being $(k,G)$-clusterable if at least $εn$ points need to be removed from $S$ to make it $(k,G)$-clusterable. We solve this problem for $k=1$ and when $G$ is a symmetric convex object. For $k>1$, we solve a weaker version of this problem. Finally, as an application of our testing result, in clustering with outliers, we show that one can find the approximate clusters by querying a constant size sample, with high probability.

preprint2011arXiv

Computing Bits of Algebraic Numbers

We initiate the complexity theoretic study of the problem of computing the bits of (real) algebraic numbers. This extends the work of Yap on computing the bits of transcendental numbers like π, in Logspace. Our main result is that computing a bit of a fixed real algebraic number is in C=NC1\subseteq Logspace when the bit position has a verbose (unary) representation and in the counting hierarchy when it has a succinct (binary) representation. Our tools are drawn from elementary analysis and numerical analysis, and include the Newton-Raphson method. The proof of our main result is entirely elementary, preferring to use the elementary Liouville's theorem over the much deeper Roth's theorem for algebraic numbers. We leave the possibility of proving non-trivial lower bounds for the problem of computing the bits of an algebraic number given the bit position in binary, as our main open question. In this direction we show very limited progress by proving a lower bound for rationals.

Rameshwar Pratap

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Improving \textit{Tug-of-War} sketch using Control-Variates method

One-pass additive-error subset selection for $\ell_{p}$ subspace approximation

Subspace approximation with outliers

Frequent-Itemset Mining using Locality-Sensitive Hashing

Similarity preserving compressions of high dimensional sparse data

Testing Uniformity of Stationary Distribution

Helly-Type Theorems in Property Testing

Computing Bits of Algebraic Numbers