Source author record

Rasmus Pagh

Rasmus Pagh appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Databases Cryptography and Security Information Retrieval Computation Computational Geometry Discrete Mathematics Distributed, Parallel, and Cluster Computing Information Theory math.CO math.IT math.NA Social and Information Networks

Catalog footprint

What is connected

41works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming

We revisit Nisan's classical pseudorandom generator (PRG) for space-bounded computation (STOC 1990) and its applications in streaming algorithms. We describe a new generator, HashPRG, that can be thought of as a symmetric version of Nisan's generator over larger alphabets. Our generator allows a trade-off between seed length and the time needed to compute a given block of the generator's output. HashPRG can be used to obtain derandomizations with much better update time and \emph{without sacrificing space} for a large number of data stream algorithms, such as $F_p$ estimation in the parameter regimes $p > 2$ and $0 < p < 2$ and CountSketch with tight estimation guarantees as analyzed by Minton and Price (SODA 2014) which assumed access to a random oracle. We also show a recent analysis of Private CountSketch can be derandomized using our techniques. For a $d$-dimensional vector $x$ being updated in a turnstile stream, we show that $\|x\|_{\infty}$ can be estimated up to an additive error of $\varepsilon\|x\|_{2}$ using $O(\varepsilon^{-2}\log(1/\varepsilon)\log d)$ bits of space. Additionally, the update time of this algorithm is $O(\log 1/\varepsilon)$ in the Word RAM model. We show that the space complexity of this algorithm is optimal up to constant factors. However, for vectors $x$ with $\|x\|_{\infty} = Θ(\|x\|_{2})$, we show that the lower bound can be broken by giving an algorithm that uses $O(\varepsilon^{-2}\log d)$ bits of space which approximates $\|x\|_{\infty}$ up to an additive error of $\varepsilon\|x\|_{2}$. We use our aforementioned derandomization of the CountSketch data structure to obtain this algorithm, and using the time-space trade off of HashPRG, we show that the update time of this algorithm is also $O(\log 1/\varepsilon)$ in the Word RAM model.

preprint2022arXiv

DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search

Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for kernel density estimation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance.

preprint2022arXiv

HyperLogLogLog: Cardinality Estimation With One Log More

We present HyperLogLogLog, a practical compression of the HyperLogLog sketch that compresses the sketch from $O(m\log\log n)$ bits down to $m \log_2\log_2\log_2 m + O(m+\log\log n)$ bits for estimating the number of distinct elements~$n$ using $m$~registers. The algorithm works as a drop-in replacement that preserves all estimation properties of the HyperLogLog sketch, it is possible to convert back and forth between the compressed and uncompressed representations, and the compressed sketch maintains mergeability in the compressed domain. The compressed sketch can be updated in amortized constant time, assuming $n$ is sufficiently larger than $m$. We provide a C++ implementation of the sketch, and show by experimental evaluation against well-known implementations by Google and Apache that our implementation provides small sketches while maintaining competitive update and merge times. Concretely, we observed approximately a 40% reduction in the sketch size. Furthermore, we obtain as a corollary a theoretical algorithm that compresses the sketch down to $m\log_2\log_2\log_2\log_2 m+O(m\log\log\log m/\log\log m+\log\log n)$ bits.

preprint2022arXiv

Infinitely Divisible Noise in the Low Privacy Regime

Federated learning, in which training data is distributed among users and never shared, has emerged as a popular approach to privacy-preserving machine learning. Cryptographic techniques such as secure aggregation are used to aggregate contributions, like a model update, from all users. A robust technique for making such aggregates differentially private is to exploit infinite divisibility of the Laplace distribution, namely, that a Laplace distribution can be expressed as a sum of i.i.d. noise shares from a Gamma distribution, one share added by each user. However, Laplace noise is known to have suboptimal error in the low privacy regime for $\varepsilon$-differential privacy, where $\varepsilon > 1$ is a large constant. In this paper we present the first infinitely divisible noise distribution for real-valued data that achieves $\varepsilon$-differential privacy and has expected error that decreases exponentially with $\varepsilon$.

preprint2021arXiv

Advances and Open Problems in Federated Learning

Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.

preprint2021arXiv

CountSketches, Feature Hashing and the Median of Three

In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2t-1) s$, where $t, s > 0$ are integer parameters. It is known that even for $t=1$, a CountSketch allows estimating coordinates of $v$ with variance bounded by $\|v\|_2^2/s$. For $t > 1$, the estimator takes the median of $2t-1$ independent estimates, and the probability that the estimate is off by more than $2 \|v\|_2/\sqrt{s}$ is exponentially small in $t$. This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of Count-Sketch, showing an improvement in variance to $O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})$ when $t > 1$. That is, the variance decreases proportionally to $s^{-2}$, asymptotically for large enough $s$. We also study the variance in the setting where an inner product is to be estimated from two CountSketches. This finding suggests that the Feature Hashing method, which is essentially identical to CountSketch but does not make use of the median estimator, can be made more reliable at a small cost in settings where using a median estimator is possible. We confirm our theoretical findings in experiments and thereby help justify why a small constant number of estimates often suffice in practice. Our improved variance bounds are based on new general theorems about the variance and higher moments of the median of i.i.d. random variables that may be of independent interest.

preprint2021arXiv

Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of Them All?

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points $S$ and a radius parameter $r>0$, the $r$-near neighbor ($r$-NN) problem asks for a data structure that, given any query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance $r$ from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the inherent unfairness of NN data structures and shows the performance of our algorithms on real-world datasets.

preprint2020arXiv

Fair Near Neighbor Search: Independent Range Sampling in High Dimensions

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the $r$-near neighbor ($r$-NN) problem: given a radius $r>0$ and a set of points $S$, construct a data structure that, for any given query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance $r$ from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for $r$-NN where all points in $S$ that are near $q$ have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.

preprint2020arXiv

On the I/O complexity of the k-nearest neighbor problem

We consider static, external memory indexes for exact and approximate versions of the $k$-nearest neighbor ($k$-NN) problem, and show new lower bounds under a standard indivisibility assumption: - Polynomial space indexing schemes for high-dimensional $k$-NN in Hamming space cannot take advantage of block transfers: $Ω(k)$ block reads are needed to to answer a query. - For the $\ell_\infty$ metric the lower bound holds even if we allow $c$-appoximate nearest neighbors to be returned, for $c \in (1, 3)$. - The restriction to $c < 3$ is necessary: For every metric there exists an indexing scheme in the indexability model of Hellerstein et al.~using space $O(kn)$, where $n$ is the number of points, that can retrieve $k$ 3-approximate nearest neighbors using $\lceil k/B\rceil$ I/Os, which is optimal. - For specific metrics, data structures with better approximation factors are possible. For $k$-NN in Hamming space and every approximation factor $c>1$ there exists a polynomial space data structure that returns $k$ $c$-approximate nearest neighbors in $\lceil k/B\rceil$ I/Os. To show these lower bounds we develop two new techniques: First, to handle that approximation algorithms have more freedom in deciding which result set to return we develop a relaxed version of the $λ$-set workload technique of Hellerstein et al. This technique allows us to show lower bounds that hold in $d\geq n$ dimensions. To extend the lower bounds down to $d = O(k \log(n/k))$ dimensions, we develop a new deterministic dimension reduction technique that may be of independent interest.

preprint2020arXiv

On the Power of Multiple Anonymous Messages

An exciting new development in differential privacy is the shuffled model, in which an anonymous channel enables non-interactive, differentially private protocols with error much smaller than what is possible in the local model, while relying on weaker trust assumptions than in the central model. In this paper, we study basic counting problems in the shuffled model and establish separations between the error that can be achieved in the single-message shuffled model and in the shuffled model with multiple messages per user. For the problem of frequency estimation for $n$ users and a domain of size $B$, we obtain: - A nearly tight lower bound of $\tildeΩ( \min(\sqrt[4]{n}, \sqrt{B}))$ on the error in the single-message shuffled model. This implies that the protocols obtained from the amplification via shuffling work of Erlingsson et al. (SODA 2019) and Balle et al. (Crypto 2019) are essentially optimal for single-message protocols. A key ingredient in the proof is a lower bound on the error of locally-private frequency estimation in the low-privacy (aka high $ε$) regime. - Protocols in the multi-message shuffled model with $poly(\log{B}, \log{n})$ bits of communication per user and $poly\log{B}$ error, which provide an exponential improvement on the error compared to what is possible with single-message algorithms. For the related selection problem on a domain of size $B$, we prove: - A nearly tight lower bound of $Ω(B)$ on the number of users in the single-message shuffled model. This significantly improves on the $Ω(B^{1/17})$ lower bound obtained by Cheu et al. (Eurocrypt 2019), and when combined with their $\tilde{O}(\sqrt{B})$-error multi-message protocol, implies the first separation between single-message and multi-message protocols for this problem.

preprint2020arXiv

Pure Differentially Private Summation from Anonymous Messages

The shuffled (aka anonymous) model has recently generated significant interest as a candidate distributed privacy framework with trust assumptions better than the central model but with achievable errors smaller than the local model. We study pure differentially private (DP) protocols in the shuffled model for summation, a basic and widely used primitive: - For binary summation where each of n users holds a bit as an input, we give a pure $ε$-DP protocol for estimating the number of ones held by the users up to an error of $O_ε(1)$, and each user sends $O_ε(\log n)$ messages each of 1 bit. This is the first pure protocol in the shuffled model with error $o(\sqrt{n})$ for constant $ε$. Using this protocol, we give a pure $ε$-DP protocol that performs summation of real numbers in $[0, 1]$ up to an error of $O_ε(1)$, and where each user sends $O_ε(\log^3 n)$ messages each of $O(\log\log n)$ bits. - In contrast, we show that for any pure $ε$-DP protocol for binary summation in the shuffled model having absolute error $n^{0.5-Ω(1)}$, the per user communication has to be at least $Ω_ε(\sqrt{\log n})$ bits. This implies the first separation between the (bounded-communication) multi-message shuffled model and the central model, and the first separation between pure and approximate DP protocols in the shuffled model. To prove our lower bound, we consider (a generalization of) the following question: given $γ$ in $(0, 1)$, what is the smallest m for which there are two random variables $X^0, X^1$ supported on $\{0, \dots ,m\}$ such that (i) the total variation distance between $X^0$ and $X^1$ is at least $1-γ$, and (ii) the moment generating functions of $X^0$ and $X^1$ are within a constant factor of each other everywhere? We show that the answer is $m = Θ(\sqrt{\log(1/γ)})$.

preprint2020arXiv

The space complexity of inner product filters

Motivated by the problem of filtering candidate pairs in inner product similarity joins we study the following inner product estimation problem: Given parameters $d\in {\bf N}$, $α>β\geq 0$ and unit vectors $x,y\in {\bf R}^{d}$ consider the task of distinguishing between the cases $\langle x, y\rangle\leqβ$ and $\langle x, y\rangle\geq α$ where $\langle x, y\rangle = \sum_{i=1}^d x_i y_i$ is the inner product of vectors $x$ and $y$. The goal is to distinguish these cases based on information on each vector encoded independently in a bit string of the shortest length possible. In contrast to much work on compressing vectors using randomized dimensionality reduction, we seek to solve the problem deterministically, with no probability of error. Inner product estimation can be solved in general via estimating $\langle x, y\rangle$ with an additive error bounded by $\varepsilon = α- β$. We show that $d \log_2 \left(\tfrac{\sqrt{1-β}}{\varepsilon}\right) \pm Θ(d)$ bits of information about each vector is necessary and sufficient. Our upper bound is constructive and improves a known upper bound of $d \log_2(1/\varepsilon) + O(d)$ by up to a factor of 2 when $β$ is close to $1$. The lower bound holds even in a stronger model where one of the vectors is known exactly, and an arbitrary estimation function is allowed.

preprint2020arXiv

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: it provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR $\ell_p$ sampling, weighted sampling of keys according to a power $p\in[0,2]$ of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with the sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as CountSketch. Our method is the first to provide WOR sampling in the important regime of $p>1$ and the first to handle signed updates for $p>0$.

preprint2016arXiv

Approximate Furthest Neighbor with Application to Annulus Query

Much recent work has been devoted to approximate nearest neighbor queries. Motivated by applications in recommender systems, we consider approximate furthest neighbor (AFN) queries and present a simple, fast, and highly practical data structure for answering AFN queries in high- dimensional Euclidean space. The method builds on the technique of In- dyk (SODA 2003), storing random projections to provide sublinear query time for AFN. However, we introduce a different query algorithm, improving on Indyk's approximation factor and reducing the running time by a logarithmic factor. We also present a variation based on a query- independent ordering of the database points; while this does not have the provable approximation factor of the query-dependent data structure, it offers significant improvement in time and space complexity. We give a theoretical analysis, and experimental results. As an application, the query-dependent approach is used for deriving a data structure for the approximate annulus query problem, which is defined as follows: given an input set S and two parameters r > 0 and w >= 1, construct a data structure that returns for each query point q a point p in S such that the distance between p and q is at least r/w and at most wr.

preprint2016arXiv

CoveringLSH: Locality-sensitive Hashing without False Negatives

We consider a new construction of locality-sensitive hash functions for Hamming space that is \emph{covering} in the sense that is it guaranteed to produce a collision for every pair of vectors within a given radius $r$. The construction is \emph{efficient} in the sense that the expected number of hash collisions between vectors at distance~$cr$, for a given $c>1$, comes close to that of the best possible data independent LSH without the covering guarantee, namely, the seminal LSH construction of Indyk and Motwani (STOC '98). The efficiency of the new construction essentially \emph{matches} their bound when the search radius is not too large --- e.g., when $cr = o(\log(n)/\log\log n)$, where $n$ is the number of points in the data set, and when $cr = \log(n)/k$ where $k$ is an integer constant. In general, it differs by at most a factor $\ln(4)$ in the exponent of the time bounds. As a consequence, LSH-based similarity search in Hamming space can avoid the problem of false negatives at little or no cost in efficiency.

preprint2016arXiv

Distance Sensitive Bloom Filters Without False Negatives

A Bloom filter is a widely used data-structure for representing a set $S$ and answering queries of the form "Is $x$ in $S$?". By allowing some false positive answers (saying "yes" when the answer is in fact `no') Bloom filters use space significantly below what is required for storing $S$. In the distance sensitive setting we work with a set $S$ of (Hamming) vectors and seek a data structure that offers a similar trade-off, but answers queries of the form "Is $x$ close to an element of $S$?" (in Hamming distance). Previous work on distance sensitive Bloom filters have accepted false positive and false negative answers. Absence of false negatives is of critical importance in many applications of Bloom filters, so it is natural to ask if this can be also achieved in the distance sensitive setting. Our main contributions are upper and lower bounds (that are tight in several cases) for space usage in the distance sensitive setting where false negatives are not allowed.

preprint2016arXiv

Efficiently Correcting Matrix Products

We study the problem of efficiently correcting an erroneous product of two $n\times n$ matrices over a ring. Among other things, we provide a randomized algorithm for correcting a matrix product with at most $k$ erroneous entries running in $\tilde{O}(n^2+kn)$ time and a deterministic $\tilde{O}(kn^2)$-time algorithm for this problem (where the notation $\tilde{O}$ suppresses polylogarithmic terms in $n$ and $k$).

preprint2016arXiv

On the Complexity of Inner Product Similarity Join

A number of tasks in classification, information retrieval, recommendation systems, and record linkage reduce to the core problem of inner product similarity join (IPS join): identifying pairs of vectors in a collection that have a sufficiently large inner product. IPS join is well understood when vectors are normalized and some approximation of inner products is allowed. However, the general case where vectors may have any length appears much more challenging. Recently, new upper bounds based on asymmetric locality-sensitive hashing (ALSH) and asymmetric embeddings have emerged, but little has been known on the lower bound side. In this paper we initiate a systematic study of inner product similarity join, showing new lower and upper bounds. Our main results are: * Approximation hardness of IPS join in subquadratic time, assuming the strong exponential time hypothesis. * New upper and lower bounds for (A)LSH-based algorithms. In particular, we show that asymmetry can be avoided by relaxing the LSH definition to only consider the collision probability of distinct elements. * A new indexing method for IPS based on linear sketches, implying that our hardness results are not far from being tight. Our technical contributions include new asymmetric embeddings that may be of independent interest. At the conceptual level we strive to provide greater clarity, for example by distinguishing among signed and unsigned variants of IPS join and shedding new light on the effect of asymmetry.

preprint2016arXiv

Scalability and Total Recall with Fast CoveringLSH

Locality-sensitive hashing (LSH) has emerged as the dominant algorithmic technique for similarity search with strong performance guarantees in high-dimensional spaces. A drawback of traditional LSH schemes is that they may have \emph{false negatives}, i.e., the recall is less than 100\%. This limits the applicability of LSH in settings requiring precise performance guarantees. Building on the recent theoretical "CoveringLSH" construction that eliminates false negatives, we propose a fast and practical covering LSH scheme for Hamming space called \emph{Fast CoveringLSH (fcLSH)}. Inheriting the design benefits of CoveringLSH our method avoids false negatives and always reports all near neighbors. Compared to CoveringLSH we achieve an asymptotic improvement to the hash function computation time from $\mathcal{O}(dL)$ to $\mathcal{O}(d + L\log{L})$, where $d$ is the dimensionality of data and $L$ is the number of hash tables. Our experiments on synthetic and real-world data sets demonstrate that \emph{fcLSH} is comparable (and often superior) to traditional hashing-based approaches for search radius up to 20 in high-dimensional Hamming space.

preprint2015arXiv

Association Rule Mining using Maximum Entropy

Recommendations based on behavioral data may be faced with ambiguous statistical evidence. We consider the case of association rules, relevant e.g.~for query and product recommendations. For example: Suppose that a customer belongs to categories A and B, each of which is known to have positive correlation with buying product C, how do we estimate the probability that she will buy product C? For rare terms or products there may not be enough data to directly produce such an estimate --- perhaps we never directly observed a connection between A, B, and C. What can we do when there is no support for estimating the probability by simply computing the observed frequency? In particular, what is the right thing to do when A and B give rise to very different estimates of the probability of C? We consider the use of maximum entropy probability estimates, which give a principled way of extrapolating probabilities of events that do not even occur in the data set! Focusing on the basic case of three variables, our main technical contributions are that (under mild assumptions): 1) There exists a simple, explicit formula that gives a good approximation of maximum entropy estimates, and 2) Maximum entropy estimates based on a small number of samples are provably tightly concentrated around the true maximum entropy frequency that arises if we let the number of samples go to infinity. Our empirical work demonstrates the surprising precision of maximum entropy estimates, across a range of real-life transaction data sets. In particular we observe the average absolute error on maximum entropy estimates is a factor $3$--$14$ less compared to using independence or extrapolation estimates, when the data used to make the estimates has low support. We believe that the same principle can be used to synthesize probability estimates in many settings.

preprint2015arXiv

From Independence to Expansion and Back Again

We consider the following fundamental problems: (1) Constructing $k$-independent hash functions with a space-time tradeoff close to Siegel's lower bound. (2) Constructing representations of unbalanced expander graphs having small size and allowing fast computation of the neighbor function. It is not hard to show that these problems are intimately connected in the sense that a good solution to one of them leads to a good solution to the other one. In this paper we exploit this connection to present efficient, recursive constructions of $k$-independent hash functions (and hence expanders with a small representation). While the previously most efficient construction (Thorup, FOCS 2013) needed time quasipolynomial in Siegel's lower bound, our time bound is just a logarithmic factor from the lower bound.

preprint2015arXiv

Simple Multi-Party Set Reconciliation

As users migrate information to cloud storage, many distributed cloud-based services use multiple loosely consistent replicas of user information to avoid the high overhead of more tightly coupled synchronization. Periodically, the information must be synchronized, or reconciled. One can place this problem in the theoretical framework of {\em set reconciliation}: two parties $A_1$ and $A_2$ each hold a set of keys, named $S_1$ and $S_2$ respectively, and the goal is for both parties to obtain $S_1 \cup S_2$. Typically, set reconciliation is interesting algorithmically when sets are large but the set difference $|S_1-S_2|+|S_2-S_1|$ is small. In this setting the focus is on accomplishing reconciliation efficiently in terms of communication; ideally, the communication should depend on the size of the set difference, and not on the size of the sets. In this paper, we extend recent approaches using Invertible Bloom Lookup Tables (IBLTs) for set reconciliation to the multi-party setting. In this setting there are three or more parties $A_1,A_2,\ldots,A_n$ holding sets of keys $S_1,S_2,\ldots,S_n$ respectively, and the goal is for all parties to obtain $\cup_i S_i$. This could of course be done by pairwise reconciliations, but we seek more effective methods. Our methodology uses network coding techniques in conjunction with IBLTs, allowing efficiency in network utilization along with efficiency obtained by passing messages of size $O(|\cup_i S_i - \cap_i S_i|)$. Further, our approach can function even if the number of parties is not exactly known in advance, and in many cases can be used to determine which parties contain keys not in the joint union. By connecting reconciliation with network coding, we can allow for substantially more efficient reconciliation methods that apply to a number of natural distributed computing problems.

preprint2015arXiv

Triangle counting in dynamic graph streams

Estimating the number of triangles in graph streams using a limited amount of memory has become a popular topic in the last decade. Different variations of the problem have been studied, depending on whether the graph edges are provided in an arbitrary order or as incidence lists. However, with a few exceptions, the algorithms have considered {\em insert-only} streams. We present a new algorithm estimating the number of triangles in {\em dynamic} graph streams where edges can be both inserted and deleted. We show that our algorithm achieves better time and space complexity than previous solutions for various graph classes, for example sparse graphs with a relatively small number of triangles. Also, for graphs with constant transitivity coefficient, a common situation in real graphs, this is the first algorithm achieving constant processing time per edge. The result is achieved by a novel approach combining sampling of vertex triples and sparsification of the input graph. In the course of the analysis of the algorithm we present a lower bound on the number of pairwise independent 2-paths in general graphs which might be of independent interest. At the end of the paper we discuss lower bounds on the space complexity of triangle counting algorithms that make no assumptions on the structure of the graph.

preprint2014arXiv

Approximate Range Emptiness in Constant Time and Optimal Space

This paper studies the \emph{$\varepsilon$-approximate range emptiness} problem, where the task is to represent a set $S$ of $n$ points from $\{0,\ldots,U-1\}$ and answer emptiness queries of the form "$[a ; b]\cap S \neq \emptyset$ ?" with a probability of \emph{false positives} allowed. This generalizes the functionality of \emph{Bloom filters} from single point queries to any interval length $L$. Setting the false positive rate to $\varepsilon/L$ and performing $L$ queries, Bloom filters yield a solution to this problem with space $O(n \lg(L/\varepsilon))$ bits, false positive probability bounded by $\varepsilon$ for intervals of length up to $L$, using query time $O(L \lg(L/\varepsilon))$. Our first contribution is to show that the space/error trade-off cannot be improved asymptotically: Any data structure for answering approximate range emptiness queries on intervals of length up to $L$ with false positive probability $\varepsilon$, must use space $Ω(n \lg(L/\varepsilon)) - O(n)$ bits. On the positive side we show that the query time can be improved greatly, to constant time, while matching our space lower bound up to a lower order additive term. This result is achieved through a succinct data structure for (non-approximate 1d) range emptiness/reporting queries, which may be of independent interest.

preprint2014arXiv

Consistent Subset Sampling

Consistent sampling is a technique for specifying, in small space, a subset $S$ of a potentially large universe $U$ such that the elements in $S$ satisfy a suitably chosen sampling condition. Given a subset $\mathcal{I}\subseteq U$ it should be possible to quickly compute $\mathcal{I}\cap S$, i.e., the elements in $\mathcal{I}$ satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-$k$ subsets occurring in some set in a collection of sets of bounded size $b$, where $k$ is a small integer. This can be done by applying standard consistent sampling to the $k$-subsets of each set, but that approach requires time $Θ(b^k)$. Using a carefully designed hash function, for a given sampling probability $p \in (0,1]$, we show how to improve the time complexity to $Θ(b^{\lceil k/2\rceil}\log \log b + pb^k)$ in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is $Θ(b^{\lceil k/4\rceil})$. We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent $k$-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.

preprint2014arXiv

Generating k-independent variables in constant time

The generation of pseudorandom elements over finite fields is fundamental to the time, space and randomness complexity of randomized algorithms and data structures. We consider the problem of generating $k$-independent random values over a finite field $\mathbb{F}$ in a word RAM model equipped with constant time addition and multiplication in $\mathbb{F}$, and present the first nontrivial construction of a generator that outputs each value in constant time, not dependent on $k$. Our generator has period length $|\mathbb{F}|\,\mbox{poly} \log k$ and uses $k\,\mbox{poly}(\log k) \log |\mathbb{F}|$ bits of space, which is optimal up to a $\mbox{poly} \log k$ factor. We are able to bypass Siegel's lower bound on the time-space tradeoff for $k$-independent functions by a restriction to sequential evaluation.

preprint2014arXiv

The Input/Output Complexity of Sparse Matrix Multiplication

We consider the problem of multiplying sparse matrices (over a semiring) where the number of non-zero entries is larger than main memory. In the classical paper of Hong and Kung (STOC '81) it was shown that to compute a product of dense $U \times U$ matrices, $Θ\left(U^3 / (B \sqrt{M}) \right)$ I/Os are necessary and sufficient in the I/O model with internal memory size $M$ and memory block size $B$. In this paper we generalize the upper and lower bounds of Hong and Kung to the sparse case. Our bounds depend of the number $N = \mathtt{nnz}(A)+\mathtt{nnz}(C)$ of nonzero entries in $A$ and $C$, as well as the number $Z = \mathtt{nnz}(AC)$ of nonzero entries in $AC$. We show that $AC$ can be computed using $\tilde{O} \left(\tfrac{N}{B} \min\left(\sqrt{\tfrac{Z}{M}},\tfrac{N}{M}\right) \right)$ I/Os, with high probability. This is tight (up to polylogarithmic factors) when only semiring operations are allowed, even for dense rectangular matrices: We show a lower bound of $Ω\left(\tfrac{N}{B} \min\left(\sqrt{\tfrac{Z}{M}},\tfrac{N}{M}\right) \right)$ I/Os. While our lower bound uses fairly standard techniques, the upper bound makes use of ``compressed matrix multiplication'' sketches, which is new in the context of I/O-efficient algorithms, and a new matrix product size estimation technique that avoids the ``no cancellation'' assumption.

preprint2014arXiv

The Input/Output Complexity of Triangle Enumeration

We consider the well-known problem of enumerating all triangles of an undirected graph. Our focus is on determining the input/output (I/O) complexity of this problem. Let $E$ be the number of edges, $M<E$ the size of internal memory, and $B$ the block size. The best results obtained previously are sort$(E^{3/2})$ I/Os (Dementiev, PhD thesis 2006) and $O(E^2/(MB))$ I/Os (Hu et al., SIGMOD 2013), where sort$(n)$ denotes the number of I/Os for sorting $n$ items. We improve the I/O complexity to $O(E^{3/2}/(\sqrt{M} B))$ expected I/Os, which improves the previous bounds by a factor $\min(\sqrt{E/M},\sqrt{M})$. Our algorithm is cache-oblivious and also I/O optimal: We show that any algorithm enumerating $t$ distinct triangles must always use $Ω(t/(\sqrt{M} B))$ I/Os, and there are graphs for which $t=Ω(E^{3/2})$. Finally, we give a deterministic cache-aware algorithm using $O(E^{3/2}/(\sqrt{M} B))$ I/Os assuming $M\geq E^\varepsilon$ for a constant $\varepsilon > 0$. Our results are based on a new color coding technique, which may be of independent interest.

preprint2013arXiv

How to Approximate A Set Without Knowing Its Size In Advance

The dynamic approximate membership problem asks to represent a set S of size n, whose elements are provided in an on-line fashion, supporting membership queries without false negatives and with a false positive rate at most epsilon. That is, the membership algorithm must be correct on each x in S, and may err with probability at most epsilon on each x not in S. We study a well-motivated, yet insufficiently explored, variant of this problem where the size n of the set is not known in advance. Existing optimal approximate membership data structures require that the size is known in advance, but in many practical scenarios this is not a realistic assumption. Moreover, even if the eventual size n of the set is known in advance, it is desirable to have the smallest possible space usage also when the current number of inserted elements is smaller than n. Our contribution consists of the following results: - We show a super-linear gap between the space complexity when the size is known in advance and the space complexity when the size is not known in advance. - We show that our space lower bound is tight, and can even be matched by a highly efficient data structure.

preprint2012arXiv

On Parallelizing Matrix Multiplication by the Column-Row Method

We consider the problem of sparse matrix multiplication by the column row method in a distributed setting where the matrix product is not necessarily sparse. We present a surprisingly simple method for "consistent" parallel processing of sparse outer products (column-row vector products) over several processors, in a communication-avoiding setting where each processor has a copy of the input. The method is consistent in the sense that a given output entry is always assigned to the same processor independently of the specific structure of the outer product. We show guarantees on the work done by each processor, and achieve linear speedup down to the point where the cost is dominated by reading the input. Our method gives a way of distributing (or parallelizing) matrix product computations in settings where the main bottlenecks are storing the result matrix, and inter-processor communication. Motivated by observations on real data that often the absolute values of the entries in the product adhere to a power law, we combine our approach with frequent items mining algorithms and show how to obtain a tight approximation of the weight of the heaviest entries in the product matrix. As a case study we present the application of our approach to frequent pair mining in transactional data streams, a problem that can be phrased in terms of sparse ${0,1}$-integer matrix multiplication by the column-row method. Experimental evaluation of the proposed method on real-life data supports the theoretical findings.

preprint2012arXiv

Thresholds for Extreme Orientability

Multiple-choice load balancing has been a topic of intense study since the seminal paper of Azar, Broder, Karlin, and Upfal. Questions in this area can be phrased in terms of orientations of a graph, or more generally a k-uniform random hypergraph. A (d,b)-orientation is an assignment of each edge to d of its vertices, such that no vertex has more than b edges assigned to it. Conditions for the existence of such orientations have been completely documented except for the "extreme" case of (k-1,1)-orientations. We consider this remaining case, and establish: - The density threshold below which an orientation exists with high probability, and above which it does not exist with high probability. - An algorithm for finding an orientation that runs in linear time with high probability, with explicit polynomial bounds on the failure probability. Previously, the only known algorithms for constructing (k-1,1)-orientations worked for k<=3, and were only shown to have expected linear running time.

preprint2011arXiv

A New Data Layout For Set Intersection on GPUs

Set intersection is the core in a variety of problems, e.g. frequent itemset mining and sparse boolean matrix multiplication. It is well-known that large speed gains can, for some computational problems, be obtained by using a graphics processing unit (GPU) as a massively parallel computing device. However, GPUs require highly regular control flow and memory access patterns, and for this reason previous GPU methods for intersecting sets have used a simple bitmap representation. This representation requires excessive space on sparse data sets. In this paper we present a novel data layout, "BatMap", that is particularly well suited for parallel processing, and is compact even for sparse data. Frequent itemset mining is one of the most important applications of set intersection. As a case-study on the potential of BatMaps we focus on frequent pair mining, which is a core special case of frequent itemset mining. The main finding is that our method is able to achieve speedups over both Apriori and FP-growth when the number of distinct items is large, and the density of the problem instance is above 1%. Previous implementations of frequent itemset mining on GPU have not been able to show speedups over the best single-threaded implementations.

preprint2011arXiv

Better size estimation for sparse matrix products

We consider the problem of doing fast and reliable estimation of the number of non-zero entries in a sparse boolean matrix product. This problem has applications in databases and computer algebra. Let n denote the total number of non-zero entries in the input matrices. We show how to compute a 1 +- epsilon approximation (with small probability of error) in expected time O(n) for any epsilon > 4/\sqrt[4]{z}. The previously best estimation algorithm, due to Cohen (JCSS 1997), uses time O(n/epsilon^2). We also present a variant using O(sort(n)) I/Os in expectation in the cache-oblivious model. In contrast to these results, the currently best algorithms for computing a sparse boolean matrix product use time omega(n^{4/3}) (resp. omega(n^{4/3}/B) I/Os), even if the result matrix has only z=O(n) nonzero entries. Our algorithm combines the size estimation technique of Bar-Yossef et al. (RANDOM 2002) with a particular class of pairwise independent hash functions that allows the sketch of a set of the form A x C to be computed in expected time O(|A|+|C|) and O(sort(|A|+|C|)) I/Os. We then describe how sampling can be used to maintain (independent) sketches of matrices that allow estimation to be performed in time o(n) if z is sufficiently large. This gives a simpler alternative to the sketching technique of Ganguly et al. (PODS 2005), and matches a space lower bound shown in that paper. Finally, we present experiments on real-world data sets that show the accuracy of both our methods to be significantly better than the worst-case analysis predicts.

preprint2011arXiv

Colorful Triangle Counting and a MapReduce Implementation

In this note we introduce a new randomized algorithm for counting triangles in graphs. We show that under mild conditions, the estimate of our algorithm is strongly concentrated around the true number of triangles. Specifically, if $p \geq \max{(\frac{Δ\log{n}}{t}, \frac{\log{n}}{\sqrt{t}})}$, where $n$, $t$, $Δ$ denote the number of vertices in $G$, the number of triangles in $G$, the maximum number of triangles an edge of $G$ is contained, then for any constant $ε>0$ our unbiased estimate $T$ is concentrated around its expectation, i.e., $ \Prob{|T - \Mean{T}| \geq ε\Mean{T}} = o(1)$. Finally, we present a \textsc{MapReduce} implementation of our algorithm.

preprint2011arXiv

Compressed Matrix Multiplication

Motivated by the problems of computing sample covariance matrices, and of transforming a collection of vectors to a basis where they are sparse, we present a simple algorithm that computes an approximation of the product of two n-by-n real matrices A and B. Let ||AB||_F denote the Frobenius norm of AB, and b be a parameter determining the time/accuracy trade-off. Given 2-wise independent hash functions $_1,h_2: [n] -> [b], and s_1,s_2: [n] -> {-1,+1} the algorithm works by first "compressing" the matrix product into the polynomial p(x) = sum_{k=1}^n (sum_{i=1}^n A_{ik} s_1(i) x^{h_1(i)}) (sum_{j=1}^n B_{kj} s_2(j) x^{h_2(j)}) Using FFT for polynomial multiplication, we can compute c_0,...,c_{b-1} such that sum_i c_i x^i = (p(x) mod x^b) + (p(x) div x^b) in time Õ(n^2+ n b). An unbiased estimator of (AB)_{ij} with variance at most ||AB||_F^2 / b can then be computed as: C_{ij} = s_1(i) s_2(j) c_{(h_1(i)+h_2(j)) mod b. Our approach also leads to an algorithm for computing AB exactly, whp., in time Õ(N + nb) in the case where A and B have at most N nonzero entries, and AB has at most b nonzero entries. Also, we use error-correcting codes in a novel way to recover significant entries of AB in near-linear time.

preprint2011arXiv

I/O-Efficient Data Structures for Colored Range and Prefix Reporting

Motivated by information retrieval applications, we consider the one-dimensional colored range reporting problem in rank space. The goal is to build a static data structure for sets C_1,...,C_m \subseteq {1,...,sigma} that supports queries of the kind: Given indices a,b, report the set Union_{a <= i <= b} C_i. We study the problem in the I/O model, and show that there exists an optimal linear-space data structure that answers queries in O(1+k/B) I/Os, where k denotes the output size and B the disk block size in words. In fact, we obtain the same bound for the harder problem of three-sided orthogonal range reporting. In this problem, we are to preprocess a set of n two-dimensional points in rank space, such that all points inside a query rectangle of the form [x_1,x_2] x (-infinity,y] can be reported. The best previous bounds for this problem is either O(n lg^2_B n) space and O(1+k/B) query I/Os, or O(n) space and O(lg^(h)_B n +k/B) query I/Os, where lg^(h)_B n is the base B logarithm iterated h times, for any constant integer h. The previous bounds are both achieved under the indivisibility assumption, while our solution exploits the full capabilities of the underlying machine. Breaking the indivisibility assumption thus provides us with cleaner and optimal bounds. Our results also imply an optimal solution to the following colored prefix reporting problem. Given a set S of strings, each O(1) disk blocks in length, and a function c: S -> 2^{1,...,sigma}, support queries of the kind: Given a string p, report the set Union_{x in S intersection p*} c(x), where p* denotes the set of strings with prefix p. Finally, we consider the possibility of top-k extensions of this result, and present a simple solution in a model that allows non-blocked I/O.

preprint2010arXiv

Finding Associations and Computing Similarity via Biased Pair Sampling

This version is ***superseded*** by a full version that can be found at http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger theoretical results and fixes a mistake in the reporting of experiments. Abstract: Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for many similarity measures no such methods have been known. In this paper we show how a wide variety of measures can be supported by a simple biased sampling method. The method also extends to find high-confidence association rules. We demonstrate theoretically that our method is superior to exact methods when the threshold for "interesting similarity/confidence" is above the average pairwise similarity/confidence, and the average support is not too low. Our method is particularly good when transactions contain many items. We confirm in experiments on standard association mining benchmarks that this gives a significant speedup on real data sets (sometimes much larger than the theoretical guarantees). Reductions in computation time of over an order of magnitude, and significant savings in space, are observed.

preprint2010arXiv

On Finding Frequent Patterns in Directed Acyclic Graphs

Given a directed acyclic graph with labeled vertices, we consider the problem of finding the most common label sequences ("traces") among all paths in the graph (of some maximum length m). Since the number of paths can be huge, we propose novel algorithms whose time complexity depends only on the size of the graph, and on the relative frequency epsilon of the most frequent traces. In addition, we apply techniques from streaming algorithms to achieve space usage that depends only on epsilon, and not on the number of distinct traces. The abstract problem considered models a variety of tasks concerning finding frequent patterns in event sequences. Our motivation comes from working with a data set of 2 million RFID readings from baggage trolleys at Copenhagen Airport. The question of finding frequent passenger movement patterns is mapped to the above problem. We report on experimental findings for this data set.

preprint2010arXiv

On Finding Frequent Patterns in Event Sequences

Given a directed acyclic graph with labeled vertices, we consider the problem of finding the most common label sequences ("traces") among all paths in the graph (of some maximum length m). Since the number of paths can be huge, we propose novel algorithms whose time complexity depends only on the size of the graph, and on the frequency epsilon of the most frequent traces. In addition, we apply techniques from streaming algorithms to achieve space usage that depends only on epsilon, and not on the number of distinct traces. The abstract problem considered models a variety of tasks concerning finding frequent patterns in event sequences. Our motivation comes from working with a data set of 2 million RFID readings from baggage trolleys at Copenhagen Airport. The question of finding frequent passenger movement patterns is mapped to the above problem. We report on experimental findings for this data set.

preprint2010arXiv

On Finding Similar Items in a Stream of Transactions

While there has been a lot of work on finding frequent itemsets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures. This paper is a first attempt at dealing with this, arguably more important, problem. We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent itemsets: Any algorithm that (even only approximately and with a chance of error) finds the most frequent k-itemset must use space Omega(min{mb,n^k,(mb/phi)^k}) bits, where mb is the number of items in the stream so far, n is the number of distinct items and phi is a support threshold. To achieve any non-trivial space upper bound we must thus abandon a worst-case assumption on the data stream. We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similarity measures, but the mining accuracy improves with the length of the stream for any fixed support threshold.

preprint2010arXiv

Tight Thresholds for Cuckoo Hashing via XORSAT

We settle the question of tight thresholds for offline cuckoo hashing. The problem can be stated as follows: we have n keys to be hashed into m buckets each capable of holding a single key. Each key has k >= 3 (distinct) associated buckets chosen uniformly at random and independently of the choices of other keys. A hash table can be constructed successfully if each key can be placed into one of its buckets. We seek thresholds alpha_k such that, as n goes to infinity, if n/m <= alpha for some alpha < alpha_k then a hash table can be constructed successfully with high probability, and if n/m >= alpha for some alpha > alpha_k a hash table cannot be constructed successfully with high probability. Here we are considering the offline version of the problem, where all keys and hash values are given, so the problem is equivalent to previous models of multiple-choice hashing. We find the thresholds for all values of k > 2 by showing that they are in fact the same as the previously known thresholds for the random k-XORSAT problem. We then extend these results to the setting where keys can have differing number of choices, and provide evidence in the form of an algorithm for a conjecture extending this result to cuckoo hash tables that store multiple keys in a bucket.

Rasmus Pagh

What is connected

Connect this record

See the researcher in context

Building this map preview

41 published item(s)

Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming

DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search

HyperLogLogLog: Cardinality Estimation With One Log More

Infinitely Divisible Noise in the Low Privacy Regime

Advances and Open Problems in Federated Learning

CountSketches, Feature Hashing and the Median of Three

Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of Them All?

Fair Near Neighbor Search: Independent Range Sampling in High Dimensions

On the I/O complexity of the k-nearest neighbor problem

On the Power of Multiple Anonymous Messages

Pure Differentially Private Summation from Anonymous Messages

The space complexity of inner product filters

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

Approximate Furthest Neighbor with Application to Annulus Query

CoveringLSH: Locality-sensitive Hashing without False Negatives

Distance Sensitive Bloom Filters Without False Negatives

Efficiently Correcting Matrix Products

On the Complexity of Inner Product Similarity Join

Scalability and Total Recall with Fast CoveringLSH

Association Rule Mining using Maximum Entropy

From Independence to Expansion and Back Again

Simple Multi-Party Set Reconciliation

Triangle counting in dynamic graph streams

Approximate Range Emptiness in Constant Time and Optimal Space

Consistent Subset Sampling

Generating k-independent variables in constant time

The Input/Output Complexity of Sparse Matrix Multiplication

The Input/Output Complexity of Triangle Enumeration

How to Approximate A Set Without Knowing Its Size In Advance

On Parallelizing Matrix Multiplication by the Column-Row Method

Thresholds for Extreme Orientability

A New Data Layout For Set Intersection on GPUs

Better size estimation for sparse matrix products

Colorful Triangle Counting and a MapReduce Implementation

Compressed Matrix Multiplication

I/O-Efficient Data Structures for Colored Range and Prefix Reporting

Finding Associations and Computing Similarity via Biased Pair Sampling

On Finding Frequent Patterns in Directed Acyclic Graphs

On Finding Frequent Patterns in Event Sequences

On Finding Similar Items in a Stream of Transactions

Tight Thresholds for Cuckoo Hashing via XORSAT