Source author record

Mikkel Thorup

Mikkel Thorup appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Discrete Mathematics math.CO Computational Complexity Computational Geometry Distributed, Parallel, and Cluster Computing Machine Learning

Catalog footprint

What is connected

33works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

PageRank Centrality in Directed Graphs with Bounded In-Degree

We study the computational complexity of locally estimating a node's PageRank centrality in a directed graph $G$. For any node $t$, its PageRank centrality $π(t)$ is defined as the probability that a random walk in $G$, starting from a uniformly chosen node, terminates at $t$, where each step terminates with a constant probability $α\in(0,1)$. To obtain a multiplicative $\big(1\pm O(1)\big)$-approximation of $π(t)$ with probability $Ω(1)$, the previously best upper bound is $O(n^{1/2}\min\{ Δ_{in}^{1/2},Δ_{out}^{1/2},m^{1/4}\})$ from [Wang, Wei, Wen, Yang, STOC '24], where $n$ and $m$ denote the number of nodes and edges in $G$, and $Δ_{in}$ and $Δ_{out}$ upper bound the in-degrees and out-degrees of $G$, respectively. Using a refinement of the proof in the same paper, we establish a lower bound of $Ω(n^{1/2}\min\{Δ_{in}^{1/2}/n^γ,Δ_{out}^{1/2}/n^γ,m^{1/4}\})$, where $γ=\frac{1}{2}(2\max\{\log_{1/(1-α)}Δ_{in},1\}-1)^{-1}$. As $γ$ only depends on $Δ_{in}$ and $n^γ=O(1)$ for $Δ_{in}=Ω\left(n^{Ω(1)}\right)$, the known upper bound is tight if we only parameterize the complexity by $n$, $m$, and $Δ_{out}$. However, there remains a gap of $Ω(n^γ)$ when considering $Δ_{in}$, and this gap is large when $Δ_{in}$ is small. In the extreme case where $Δ_{in}\le1/(1-α)$, we have $γ=1/2$, leading to a gap of $Ω(n^{1/2})$ between the bounds $O(n^{1/2})$ and $Ω(1)$. In this paper, we present a new algorithm that achieves the above lower bound (up to logarithmic factors). The algorithm assumes that $n$ and the bounds $Δ_{in}$ and $Δ_{out}$ are known in advance. Our key technique is a novel randomized backwards propagation process that only propagates selectively based on Monte Carlo estimated PageRank scores.

preprint2024arXiv

Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming

We revisit Nisan's classical pseudorandom generator (PRG) for space-bounded computation (STOC 1990) and its applications in streaming algorithms. We describe a new generator, HashPRG, that can be thought of as a symmetric version of Nisan's generator over larger alphabets. Our generator allows a trade-off between seed length and the time needed to compute a given block of the generator's output. HashPRG can be used to obtain derandomizations with much better update time and \emph{without sacrificing space} for a large number of data stream algorithms, such as $F_p$ estimation in the parameter regimes $p > 2$ and $0 < p < 2$ and CountSketch with tight estimation guarantees as analyzed by Minton and Price (SODA 2014) which assumed access to a random oracle. We also show a recent analysis of Private CountSketch can be derandomized using our techniques. For a $d$-dimensional vector $x$ being updated in a turnstile stream, we show that $\|x\|_{\infty}$ can be estimated up to an additive error of $\varepsilon\|x\|_{2}$ using $O(\varepsilon^{-2}\log(1/\varepsilon)\log d)$ bits of space. Additionally, the update time of this algorithm is $O(\log 1/\varepsilon)$ in the Word RAM model. We show that the space complexity of this algorithm is optimal up to constant factors. However, for vectors $x$ with $\|x\|_{\infty} = Θ(\|x\|_{2})$, we show that the lower bound can be broken by giving an algorithm that uses $O(\varepsilon^{-2}\log d)$ bits of space which approximates $\|x\|_{\infty}$ up to an additive error of $\varepsilon\|x\|_{2}$. We use our aforementioned derandomization of the CountSketch data structure to obtain this algorithm, and using the time-space trade off of HashPRG, we show that the update time of this algorithm is also $O(\log 1/\varepsilon)$ in the Word RAM model.

preprint2022arXiv

Edge Sampling and Graph Parameter Estimation via Vertex Neighborhood Accesses

In this paper, we consider the problems from the area of sublinear-time algorithms of edge sampling, edge counting, and triangle counting. Part of our contribution is that we consider three different settings, differing in the way in which one may access the neighborhood of a given vertex. In previous work, people have considered indexed neighbor access, with a query returning the $i$-th neighbor of a given vertex. Full neighborhood access model, which has a query that returns the entire neighborhood at a unit cost, has recently been considered in the applied community. Between these, we propose hash-ordered neighbor access, inspired by coordinated sampling, where we have a global fully random hash function, and can access neighbors in order of their hash values, paying a constant for each accessed neighbor. For edge sampling and counting, our new lower bounds are in the most powerful full neighborhood access model. We provide matching upper bounds in the weaker hash-ordered neighbor access model. Our new faster algorithms can be provably implemented efficiently on massive graphs in external memory and with the current APIs for, e.g., Twitter or Wikipedia. For triangle counting, we provide a separation: a better upper bound with full neighborhood access than the known lower bounds with indexed neighbor access. The technical core of our paper is our edge-sampling algorithm on which the other results depend.

preprint2022arXiv

Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor

We consider the numerical taxonomy problem of fitting a positive distance function ${D:{S\choose 2}\rightarrow \mathbb R_{>0}}$ by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match those in $D$. A nice application is in evolutionary biology where the tree $T$ aims to approximate the branching process leading to the observed distances in $D$ [Cavalli-Sforza and Edwards 1967]. We consider the total error, that is the sum of distance errors over all pairs of points. We present a deterministic polynomial time algorithm minimizing the total error within a constant factor. We can do this both for general trees, and for the special case of ultrametrics with a root having the same distance to all vertices in $S$. The problems are APX-hard, so a constant factor is the best we can hope for in polynomial time. The best previous approximation factor was $O((\log n)(\log \log n))$ by Ailon and Charikar [2005] who wrote "Determining whether an $O(1)$ approximation can be obtained is a fascinating question".

preprint2022arXiv

How to Cut Corners and Get Bounded Convex Curvature

We describe an algorithm for solving an important geometric problem arising in computer-aided manufacturing. When cutting away a region from a solid piece of material -- such as steel, wood, ceramics, or plastic -- using a rough tool in a milling machine, sharp convex corners of the region cannot be done properly, but have to be left for finer tools that are more expensive to use. We want to determine a toolpath that maximizes the use of the rough tool. In order to formulate the problem in mathematical terms, we introduce the notion of bounded convex curvature. A region of points in the plane $Q$ has \emph{bounded convex curvature} if for any point $x\in\partial Q$, there is a unit disk $U$ and $\varepsilon>0$ such that $x\in \partial U$ and all points in $U$ within distance $\varepsilon$ from $x$ are in $Q$. This translates to saying that as we traverse the boundary $\partial Q$ with the interior of $Q$ on the left side, then $\partial Q$ turns to the left with curvature at most $1$. There is no bound on the curvature where $\partial Q$ turns to the right. Given a region of points $P$ in the plane, we are now interested in computing the maximum subset $Q\subseteq P$ of bounded convex curvature. The difference in the requirement to left- and right-curvature is a natural consequence of different conditions when machining convex and concave areas of $Q$. We devise an algorithm to compute the unique maximum such set $Q$, when the boundary of $P$ consists of $n$ line segments and circular arcs of arbitrary radii. In the general case where $P$ may have holes, the algorithm runs in time $O(n^2)$ and uses $O(n)$ space. If $P$ is simply-connected, we describe a faster $O(n\log n)$ time algorithm.

preprint2022arXiv

Understanding the Moments of Tabulation Hashing via Chaoses

Simple tabulation hashing dates back to Zobrist in 1970 and is defined as follows: Each key is viewed as $c$ characters from some alphabet $Σ$, we have $c$ fully random hash functions $h_0, \ldots, h_{c - 1} \colon Σ\to \{0, \ldots, 2^l - 1\}$, and a key $x = (x_0, \ldots, x_{c - 1})$ is hashed to $h(x) = h_0(x_0) \oplus \ldots \oplus h_{c - 1}(x_{c - 1})$ where $\oplus$ is the bitwise XOR operation. The previous results on tabulation hashing by P{\v a}tra{\c s}cu and Thorup~[J.ACM'11] and by Aamand et al.~[STOC'20] focused on proving Chernoff-style tail bounds on hash-based sums, e.g., the number keys hashing to a given value, for simple tabulation hashing, but their bounds do not cover the entire tail. Chaoses are random variables of the form $\sum a_{i_0, \ldots, i_{c - 1}} X_{i_0} \cdot \ldots \cdot X_{i_{c - 1}}$ where $X_i$ are independent random variables. Chaoses are a well-studied concept from probability theory, and tight analysis has been proven in several instances, e.g., when the independent random variables are standard Gaussian variables and when the independent random variables have logarithmically convex tails. We notice that hash-based sums of simple tabulation hashing can be seen as a sum of chaoses that are not independent. This motivates us to use techniques from the theory of chaoses to analyze hash-based sums of simple tabulation hashing. In this paper, we obtain bounds for all the moments of hash-based sums for simple tabulation hashing which are tight up to constants depending only on $c$. In contrast with the previous attempts, our approach will mostly be analytical and does not employ intricate combinatorial arguments. The improved analysis of simple tabulation hashing allows us to obtain bounds for the moments of hash-based sums for the mixed tabulation hashing introduced by Dahlgaard et al.~[FOCS'15].

preprint2020arXiv

Fast hashing with Strong Concentration Bounds

Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of $c=O(1)$ characters, e.g., a 64-bit key as $c=8$ characters of 8-bits. The character domain $Σ$ should be small enough that character tables of size $|Σ|$ fit in fast cache. The schemes then use $O(1)$ tables of this size, so the space of tabulation hashing is $O(|Σ|)$. However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are $\ll |Σ|$. To see the problem, consider the very simple case where we use tabulation hashing to throw $n$ balls into $m$ bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if $n=m$, for then the expected value is $1$. However, if $m=2$, as when tossing $n$ unbiased coins, the expected value $n/2$ is $\gg |Σ|$ for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.

preprint2020arXiv

High Speed Hashing for Integers and Strings

These notes describe the most efficient hash functions currently known for hashing integers and strings. These modern hash functions are often an order of magnitude faster than those presented in standard text books. They are also simpler to implement, and hence a clear win in practice, but their analysis is harder. Some of the most practical hash functions have only appeared in theory papers, and some of them requires combining results from different theory papers. The goal here is to combine the information in lecture-style notes that can be used by theoreticians and practitioners alike, thus making these practical fruits of theory more widely accessible.

preprint2020arXiv

No Repetition: Fast Streaming with Highly Concentrated Hashing

To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing $r$ independent repetitions and taking the median of the estimators, the error probability falls exponentially in $r$. However, running $r$ independent experiments increases the processing time by a factor $r$. Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of $r$ independent sketches, we have a single sketch that is $r$ times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor $r$ in time, and the overall algorithms just get simpler. Fast practical hash functions with strong concentration bounds were recently proposed by Aamand em et al. (to appear in STOC 2020). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.

preprint2020arXiv

Three-in-a-Tree in Near Linear Time

The three-in-a-tree problem is to determine if a simple undirected graph contains an induced subgraph which is a tree connecting three given vertices. Based on a beautiful characterization that is proved in more than twenty pages, Chudnovsky and Seymour [Combinatorica 2010] gave the previously only known polynomial-time algorithm, running in $O(mn^2)$ time, to solve the three-in-a-tree problem on an $n$-vertex $m$-edge graph. Their three-in-a-tree algorithm has become a critical subroutine in several state-of-the-art graph recognition and detection algorithms. In this paper we solve the three-in-a-tree problem in $\tilde{O}(m)$ time, leading to improved algorithms for recognizing perfect graphs and detecting thetas, pyramids, beetles, and odd and even holes. Our result is based on a new and more constructive characterization than that of Chudnovsky and Seymour. Our new characterization is stronger than the original, and our proof implies a new simpler proof for the original characterization. The improved characterization gains the first factor $n$ in speed. The remaining improvement is based on dynamic graph algorithms.

preprint2017arXiv

Fast and Powerful Hashing using Tabulation

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of $c$ characters and we have precomputed character tables $h_1,...,h_c$ mapping characters to random hash values. A key $x=(x_1,...,x_c)$ is hashed to $h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]$. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not.

preprint2016arXiv

Hashing for statistics over k-partitions

In this paper we analyze a hash function for $k$-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin~[FOCS'83] in order to save a factor $Ω(k)$ of time per element over $k$ independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used HyperLogLog algorithm of Flajolet et al.~[AOFA'97] and in large-scale machine learning by Li et al.~[NIPS'12] for minwise estimation of set similarity. The main issue of $k$-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of $k$-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. a simple and efficient construction for invertible bloom filters and uniform hashing on a given set.

preprint2016arXiv

Heavy hitters via cluster-preserving clustering

In turnstile $\ell_p$ $\varepsilon$-heavy hitters, one maintains a high-dimensional $x\in\mathbb{R}^n$ subject to $\texttt{update}(i,Δ)$ causing $x_i\leftarrow x_i + Δ$, where $i\in[n]$, $Δ\in\mathbb{R}$. Upon receiving a query, the goal is to report a small list $L\subset[n]$, $|L| = O(1/\varepsilon^p)$, containing every "heavy hitter" $i\in[n]$ with $|x_i| \ge \varepsilon \|x_{\overline{1/\varepsilon^p}}\|_p$, where $x_{\overline{k}}$ denotes the vector obtained by zeroing out the largest $k$ entries of $x$ in magnitude. For any $p\in(0,2]$ the CountSketch solves $\ell_p$ heavy hitters using $O(\varepsilon^{-p}\log n)$ words of space with $O(\log n)$ update time, $O(n\log n)$ query time to output $L$, and whose output after any query is correct with high probability (whp) $1 - 1/poly(n)$. Unfortunately the query time is very slow. To remedy this, the work [CM05] proposed for $p=1$ in the strict turnstile model, a whp correct algorithm achieving suboptimal space $O(\varepsilon^{-1}\log^2 n)$, worse update time $O(\log^2 n)$, but much better query time $O(\varepsilon^{-1}poly(\log n))$. We show this tradeoff between space and update time versus query time is unnecessary. We provide a new algorithm, ExpanderSketch, which in the most general turnstile model achieves optimal $O(\varepsilon^{-p}\log n)$ space, $O(\log n)$ update time, and fast $O(\varepsilon^{-p}poly(\log n))$ query time, and whp correctness. Our main innovation is an efficient reduction from the heavy hitters to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a much bigger graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We then develop a "cluster-preserving clustering" algorithm, partitioning the graph into clusters without destroying any original cluster.

preprint2016arXiv

Incremental Exact Min-Cut in Poly-logarithmic Amortized Update Time

We present a deterministic incremental algorithm for \textit{exactly} maintaining the size of a minimum cut with $\widetilde{O}(1)$ amortized time per edge insertion and $O(1)$ query time. This result partially answers an open question posed by Thorup [Combinatorica 2007]. It also stays in sharp contrast to a polynomial conditional lower-bound for the fully-dynamic weighted minimum cut problem. Our algorithm is obtained by combining a recent sparsification technique of Kawarabayashi and Thorup [STOC 2015] and an exact incremental algorithm of Henzinger [J. of Algorithm 1997]. We also study space-efficient incremental algorithms for the minimum cut problem. Concretely, we show that there exists an ${O}(n\log n/\varepsilon^2)$ space Monte-Carlo algorithm that can process a stream of edge insertions starting from an empty graph, and with high probability, the algorithm maintains a $(1+\varepsilon)$-approximation to the minimum cut. The algorithm has $\widetilde{O}(1)$ amortized update-time and constant query-time.

preprint2016arXiv

The Power of Two Choices with Simple Tabulation

The power of two choices is a classic paradigm for load balancing when assigning $m$ balls to $n$ bins. When placing a ball, we pick two bins according to two hash functions $h_0$ and $h_1$, and place the ball in the least loaded bin. Assuming fully random hash functions, when $m=O(n)$, Azar et al.~[STOC'94] proved that the maximum load is $\lg \lg n + O(1)$ with high probability. In this paper, we investigate the power of two choices when the hash functions $h_0$ and $h_1$ are implemented with simple tabulation, which is a very efficient hash function evaluated in constant time. Following their analysis of Cuckoo hashing [J.ACM'12], Pǎtraşcu and Thorup claimed that the expected maximum load with simple tabulation is $O(\lg\lg n)$. This did not include any high probability guarantee, so the load balancing was not yet to be trusted. Here, we show that with simple tabulation, the maximum load is $O(\lg\lg n)$ with high probability, giving the first constant time hash function with this guarantee. We also give a concrete example where, unlike with fully random hashing, the maximum load is not bounded by $\lg \lg n + O(1)$, or even $(1+o(1))\lg \lg n$ with high probability. Finally, we show that the expected maximum load is $\lg \lg n + O(1)$, just like with fully random hashing.

preprint2015arXiv

Construction and impromptu repair of an MST in a distributed network with o(m) communication

In the CONGEST model, a communications network is an undirected graph whose $n$ nodes are processors and whose $m$ edges are the communications links between processors. At any given time step, a message of size $O(\log n)$ may be sent by each node to each of its neighbors. We show for the synchronous model: If all nodes start in the same round, and each node knows its ID and the ID's of its neighbors, or in the case of MST, the distinct weights of its incident edges and knows $n$, then there are Monte Carlo algorithms which succeed w.h.p. to determine a minimum spanning forest (MST) and a spanning forest (ST) using $O(n \log^2 n/\log\log n)$ messages for MST and $O(n \log n )$ messages for ST, resp. These results contradict the "folk theorem" noted in Awerbuch, et.al., JACM 1990 that the distributed construction of a broadcast tree requires $Ω(m)$ messages. This lower bound has been shown there and in other papers for some CONGEST models; our protocol demonstrates the limits of these models. A dynamic distributed network is one which undergoes online edge insertions or deletions. We also show how to repair an MST or ST in a dynamic network with asynchronous communication. An edge deletion can be processed in $O(n\log n /\log \log n)$ expected messages in the MST, and $O(n)$ expected messages for the ST problem, while an edge insertion uses $O(n)$ messages in the worst case. We call this "impromptu" updating as we assume that between processing of edge updates there is no preprocessing or storage of additional information. Previous algorithms for this problem that use an amortized $o(m)$ messages per update require substantial preprocessing and additional local storage between updates.

preprint2015arXiv

Faster Worst Case Deterministic Dynamic Connectivity

We present a deterministic dynamic connectivity data structure for undirected graphs with worst case update time $O\left(\sqrt{\frac{n(\log\log n)^2}{\log n}}\right)$ and constant query time. This improves on the previous best deterministic worst case algorithm of Frederickson (STOC 1983) and Eppstein Galil, Italiano, and Nissenzweig (J. ACM 1997), which had update time $O(\sqrt{n})$. All other algorithms for dynamic connectivity are either randomized (Monte Carlo) or have only amortized performance guarantees.

preprint2015arXiv

From Independence to Expansion and Back Again

We consider the following fundamental problems: (1) Constructing $k$-independent hash functions with a space-time tradeoff close to Siegel's lower bound. (2) Constructing representations of unbalanced expander graphs having small size and allowing fast computation of the neighbor function. It is not hard to show that these problems are intimately connected in the sense that a good solution to one of them leads to a good solution to the other one. In this paper we exploit this connection to present efficient, recursive constructions of $k$-independent hash functions (and hence expanders with a small representation). While the previously most efficient construction (Thorup, FOCS 2013) needed time quasipolynomial in Siegel's lower bound, our time bound is just a logarithmic factor from the lower bound.

preprint2015arXiv

Planar Reachability in Linear Space and Constant Time

We show how to represent a planar digraph in linear space so that distance queries can be answered in constant time. The data structure can be constructed in linear time. This representation of reachability is thus optimal in both time and space, and has optimal construction time. The previous best solution used $O(n\log n)$ space for constant query time [Thorup FOCS'01].

preprint2015arXiv

Sample(x)=(a*x<=t) is a distinguisher with probability 1/8

A random sampling function Sample:U->{0,1} for a key universe U is a distinguisher with probability p if for any given assignment of values v(x) to the keys x in U, including at least one non-zero v(x)!=0, the sampled sum sum{ v(x) | x in U and Sample(x) } is non-zero with probability at least p. Here the key values may come from any commutative monoid (addition is commutative and associative and zero is neutral). Such distinguishers were introduced by Vazirani [PhD thesis 1986], and Naor and Naor used them for their small bias probability spaces [STOC'90]. Constant probability distinguishers are used for testing in contexts where the key values are not computed directly, yet where the sum is easily computed. A simple example is when we get a stream of key value pairs (x_1,v_1),(x_2,v_2),...,(x_n,v_n) where the same key may appear many times. The accumulated value of key x is v(x)=sum{v_i | x_i=x}. For space reasons, we may not be able to maintain v(x) for every key x, but the sampled sum is easily maintained as the single value sum{v_i | Sample(x_i)}. Here we show that when dealing with w-bit integers, if a is a uniform odd w-bit integer and t is a uniform w-bit integer, then Sample(x)=[ax mod 2^w <= t] is a distinguisher with probability 1/8. Working with standard units, that is, w=8, 16, 32, 64, we exploit that w-bit multiplication works modulo 2^w, discarding overflow automatically, and then the sampling decision is implemented by the C-code a*x<=t. Previous such samplers were much less computer friendly, e.g., the distinguisher of Naor and Naor [STOC'90] was more complicated and involved a 7-independent hash function.

preprint2014arXiv

Adjacency labeling schemes and induced-universal graphs

We describe a way of assigning labels to the vertices of any undirected graph on up to $n$ vertices, each composed of $n/2+O(1)$ bits, such that given the labels of two vertices, and no other information regarding the graph, it is possible to decide whether or not the vertices are adjacent in the graph. This is optimal, up to an additive constant, and constitutes the first improvement in almost 50 years of an $n/2+O(\log n)$ bound of Moon. As a consequence, we obtain an induced-universal graph for $n$-vertex graphs containing only $O(2^{n/2})$ vertices, which is optimal up to a multiplicative constant, solving an open problem of Vizing from 1968. We obtain similar tight results for directed graphs, tournaments and bipartite graphs.

preprint2014arXiv

Approximately Minwise Independence with Twisted Tabulation

A random hash function $h$ is $\varepsilon$-minwise if for any set $S$, $|S|=n$, and element $x\in S$, $\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n$. Minwise hash functions with low bias $\varepsilon$ have widespread applications within similarity estimation. Hashing from a universe $[u]$, the twisted tabulation hashing of Pǎtraşcu and Thorup [SODA'13] makes $c=O(1)$ lookups in tables of size $u^{1/c}$. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields $\tilde O(1/u^{1/c})$-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79] $\tilde O(1/u^{1/c})$-minwise hashing requires $Ω(\log u)$-independence [Indyk SODA'99]. Pǎtraşcu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields $\tilde O(1/n^{1/c})$-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.

preprint2014arXiv

Dynamic Integer Sets with Optimal Rank, Select, and Predecessor Search

We present a data structure representing a dynamic set S of w-bit integers on a w-bit word RAM. With |S|=n and w > log n and space O(n), we support the following standard operations in O(log n / log w) time: - insert(x) sets S = S + {x}. - delete(x) sets S = S - {x}. - predecessor(x) returns max{y in S | y< x}. - successor(x) returns min{y in S | y >= x}. - rank(x) returns #{y in S | y< x}. - select(i) returns y in S with rank(y)=i, if any. Our O(log n/log w) bound is optimal for dynamic rank and select, matching a lower bound of Fredman and Saks [STOC'89]. When the word length is large, our time bound is also optimal for dynamic predecessor, matching a static lower bound of Beame and Fich [STOC'99] whenever log n/log w=O(log w/loglog w). Technically, the most interesting aspect of our data structure is that it supports all the above operations in constant time for sets of size n=w^{O(1)}. This resolves a main open problem of Ajtai, Komlos, and Fredman [FOCS'83]. Ajtai et al. presented such a data structure in Yao's abstract cell-probe model with w-bit cells/words, but pointed out that the functions used could not be implemented. As a partial solution to the problem, Fredman and Willard [STOC'90] introduced a fusion node that could handle queries in constant time, but used polynomial time on the updates. We call our small set data structure a dynamic fusion node as it does both queries and updates in constant time.

preprint2014arXiv

On the k-Independence Required by Linear Probing and Minwise Independence

We show that linear probing requires 5-independent hash functions for expected constant-time performance, matching an upper bound of [Pagh et al. STOC'07]. More precisely, we construct a 4-independent hash functions yielding expected logarithmic search time. For (1+ε)-approximate minwise independence, we show that Ω(log 1/ε)-independent hash functions are required, matching an upper bound of [Indyk, SODA'99]. We also show that the very fast 2-independent multiply-shift scheme of Dietzfelbinger [STACS'96] fails badly in both applications.

preprint2013arXiv

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence

We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h_1,...,h_k, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.

preprint2013arXiv

RAM-Efficient External Memory Sorting

In recent years a large number of problems have been considered in external memory models of computation, where the complexity measure is the number of blocks of data that are moved between slow external memory and fast internal memory (also called I/Os). In practice, however, internal memory time often dominates the total running time once I/O-efficiency has been obtained. In this paper we study algorithms for fundamental problems that are simultaneously I/O-efficient and internal memory efficient in the RAM model of computation.

preprint2013arXiv

Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence

Simple tabulation dates back to Zobrist in 1970. Keys are viewed as c characters from some alphabet A. We initialize c tables h_0, ..., h_{c-1} mapping characters to random hash values. A key x=(x_0, ..., x_{c-1}) is hashed to h_0[x_0] xor...xor h_{c-1}[x_{c-1}]. The scheme is extremely fast when the character hash tables h_i are in cache. Simple tabulation hashing is not 4-independent, but we show that if we apply it twice, then we get high independence. First we hash to intermediate keys that are 6 times longer than the original keys, and then we hash the intermediate keys to the final hash values. The intermediate keys have d=6c characters from A. We can view the hash function as a degree d bipartite graph with keys on one side, each with edges to d output characters. We show that this graph has nice expansion properties, and from that we get that with another level of simple tabulation on the intermediate keys, the composition is a highly independent hash function. The independence we get is |A|^{Omega(1/c)}. Our space is O(c|A|) and the hash function is evaluated in O(c) time. Siegel [FOCS'89, SICOMP'04] proved that with this space, if the hash function is evaluated in o(c) time, then the independence can only be o(c), so our evaluation time is best possible for Omega(c) independence---our independence is much higher if c=|A|^{o(1)}. Siegel used O(c)^c evaluation time to get the same independence with similar space. Siegel's main focus was c=O(1), but we are exponentially faster when c=omega(1). Applying our scheme recursively, we can increase our independence to |A|^{Omega(1)} with o(c^{log c}) evaluation time. Compared with Siegel's scheme this is both faster and higher independence. Our scheme is easy to implement, and it does provide realistic implementations of 100-independent hashing for, say, 32 and 64-bit keys.

preprint2012arXiv

Combinatorial coloring of 3-colorable graphs

We consider the problem of coloring a 3-colorable graph in polynomial time using as few colors as possible. We present a combinatorial algorithm getting down to $\tO(n^{4/11})$ colors. This is the first combinatorial improvement of Blum's $\tO(n^{3/8})$ bound from FOCS'90. Like Blum's algorithm, our new algorithm composes nicely with recent semi-definite approaches. The current best bound is $O(n^{0.2072})$ colors by Chlamtac from FOCS'07. We now bring it down to $O(n^{0.2038})$ colors.

preprint2011arXiv

Don't Rush into a Union: Take Time to Find Your Roots

We present a new threshold phenomenon in data structure lower bounds where slightly reduced update times lead to exploding query times. Consider incremental connectivity, letting t_u be the time to insert an edge and t_q be the query time. For t_u = Omega(t_q), the problem is equivalent to the well-understood union-find problem: InsertEdge(s,t) can be implemented by Union(Find(s), Find(t)). This gives worst-case time t_u = t_q = O(lg n / lglg n) and amortized t_u = t_q = O(alpha(n)). By contrast, we show that if t_u = o(lg n / lglg n), the query time explodes to t_q >= n^{1-o(1)}. In other words, if the data structure doesn't have time to find the roots of each disjoint set (tree) during edge insertion, there is no effective way to organize the information! For amortized complexity, we demonstrate a new inverse-Ackermann type trade-off in the regime t_u = o(t_q). A similar lower bound is given for fully dynamic connectivity, where an update time of o(\lg n) forces the query time to be n^{1-o(1)}. This lower bound allows for amortization and Las Vegas randomization, and comes close to the known O(lg n * poly(lglg n)) upper bound.

preprint2011arXiv

Minimum k-way cut of bounded size is fixed-parameter tractable

We consider a the minimum k-way cut problem for unweighted graphs with a size bound s on the number of cut edges allowed. Thus we seek to remove as few edges as possible so as to split a graph into k components, or report that this requires cutting more than s edges. We show that this problem is fixed-parameter tractable (FPT) in s. More precisely, for s=O(1), our algorithm runs in quadratic time while we have a different linear time algorithm for planar graphs and bounded genus graphs. Our tractability result stands in contrast to known W[1] hardness of related problems. Without the size bound, Downey et al.[2003] proved that the minimum k-way cut problem is W[1] hard in k even for simple unweighted graphs. Downey et al. asked about the status for planar graphs. Our result implies tractability in k for the planar graphs since the minimum k-way cut of a planar graph is of size at most 6k (more generally, we get tractability in k for any graph class with k-way cuts of size limited by is a function of k, e.g., bounded degree graphs, or simple graphs with an excluded minor). A simple reduction shows that vertex cuts are at least as hard as edge cuts, so the minimum k-way vertex cut is also W[1] hard in terms of k. Marx [2004] proved that finding a minimum k-way vertex cut of size s is also W[1] hard in s. Marx asked about the FPT status with edge cuts, which we prove tractable here. We are not aware of any other cut problem where the vertex version is W[1] hard but the edge version is FPT.

preprint2011arXiv

The Power of Simple Tabulation Hashing

Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired theoretical guarantees are often neither simple nor practical. Here we show that the simplest possible tabulation hashing provides unexpectedly strong guarantees. The scheme itself dates back to Carter and Wegman (STOC'77). Keys are viewed as consisting of c characters. We initialize c tables T_1, ..., T_c mapping characters to random hash codes. A key x=(x_1, ..., x_q) is hashed to T_1[x_1] xor ... xor T_c[x_c]. While this scheme is not even 4-independent, we show that it provides many of the guarantees that are normally obtained via higher independence, e.g., Chernoff-type concentration, min-wise hashing for estimating set intersection, and cuckoo hashing.

preprint2010arXiv

Stream sampling for variance-optimal estimation of subset sums

From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, $\varoptk$, that dominates all previous schemes in terms of estimation quality. $\varoptk$ provides {\em variance optimal unbiased estimation of subset sums}. More precisely, if we have seen $n$ items of the stream, then for {\em any} subset size $m$, our scheme based on $k$ samples minimizes the average variance over all subsets of size $m$. In fact, the optimality is against any off-line scheme with $k$ samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of {\em particular} subsets than previously possible. It is efficient, handling each new item of the stream in $O(\log k)$ time. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.

preprint2003arXiv

Rounding Algorithms for a Geometric Embedding of Minimum Multiway Cut

The multiway-cut problem is, given a weighted graph and k >= 2 terminal nodes, to find a minimum-weight set of edges whose removal separates all the terminals. The problem is NP-hard, and even NP-hard to approximate within 1+delta for some small delta > 0. Calinescu, Karloff, and Rabani (1998) gave an algorithm with performance guarantee 3/2-1/k, based on a geometric relaxation of the problem. In this paper, we give improved randomized rounding schemes for their relaxation, yielding a 12/11-approximation algorithm for k=3 and a 1.3438-approximation algorithm in general. Our approach hinges on the observation that the problem of designing a randomized rounding scheme for a geometric relaxation is itself a linear programming problem. The paper explores computational solutions to this problem, and gives a proof that for a general class of geometric relaxations, there are always randomized rounding schemes that match the integrality gap.

Mikkel Thorup

What is connected

Connect this record

See the researcher in context

Building this map preview

33 published item(s)

PageRank Centrality in Directed Graphs with Bounded In-Degree

Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming

Edge Sampling and Graph Parameter Estimation via Vertex Neighborhood Accesses

Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor

How to Cut Corners and Get Bounded Convex Curvature

Understanding the Moments of Tabulation Hashing via Chaoses

Fast hashing with Strong Concentration Bounds

High Speed Hashing for Integers and Strings

No Repetition: Fast Streaming with Highly Concentrated Hashing

Three-in-a-Tree in Near Linear Time

Fast and Powerful Hashing using Tabulation

Hashing for statistics over k-partitions

Heavy hitters via cluster-preserving clustering

Incremental Exact Min-Cut in Poly-logarithmic Amortized Update Time

The Power of Two Choices with Simple Tabulation

Construction and impromptu repair of an MST in a distributed network with o(m) communication

Faster Worst Case Deterministic Dynamic Connectivity

From Independence to Expansion and Back Again

Planar Reachability in Linear Space and Constant Time

Sample(x)=(a*x<=t) is a distinguisher with probability 1/8

Adjacency labeling schemes and induced-universal graphs

Approximately Minwise Independence with Twisted Tabulation

Dynamic Integer Sets with Optimal Rank, Select, and Predecessor Search

On the k-Independence Required by Linear Probing and Minwise Independence

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence

RAM-Efficient External Memory Sorting

Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence

Combinatorial coloring of 3-colorable graphs

Don't Rush into a Union: Take Time to Find Your Roots

Minimum k-way cut of bounded size is fixed-parameter tractable

The Power of Simple Tabulation Hashing

Stream sampling for variance-optimal estimation of subset sums

Rounding Algorithms for a Geometric Embedding of Minimum Multiway Cut