Source author record

William Kuszmaul

William Kuszmaul appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms math.CO Distributed, Parallel, and Cluster Computing Artificial Intelligence Discrete Mathematics Machine Learning math.QA math.RA Networking and Internet Architecture

Catalog footprint

What is connected

16works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector databases. In practice, URRs are often replaced by randomized Hadamard transforms (RHTs), which preserve orthogonality while admitting fast implementations. The remaining issue is the performance for worst-case inputs. With a URR, each coordinate is individually distributed as a shifted beta distribution, which converges to a Gaussian distribution in high dimensions. Generally, one RHT is not suitable in the worst case, as individual coordinates can be far from these distributions. We show that after composing two RHTs on any $d$-sized input vector, the marginal distribution of every fixed coordinate of the normalized rotated vector is within $O(d^{-1/2})$ of a standard Gaussian both in Kolmogorov distance and in $1$-Wasserstein distance. We then plug these bounds into the analyses of modern compression schemes, namely DRIVE and QUIC-FL, and show that two RHTs achieve performance that asymptotically matches URRs. However, we show that two RHTs may not be sufficient for Vector Quantization (VQ), which often requires weak correlation across fixed-size blocks of coordinates (as opposed to only marginal distribution convergence for single coordinates). We prove that a composition of three RHTs leads to decaying coordinate covariance. This ensures that any fixed, bounded, multi-dimensional VQ codebook optimized for URRs has the same expected error when using three RHTs, up to an additive term that vanishes with the dimension. Finally, because practical inputs are rarely adversarial, we propose a linear-time ${O}(d)$ check on the input's moments to dynamically adapt the number of RHTs used at runtime to improve performance.

preprint2022arXiv

Approximating Dynamic Time Warping Distance Between Run-Length Encoded Strings

Dynamic Time Warping (DTW) is a widely used similarity measure for comparing strings that encode time series data, with applications to areas including bioinformatics, signature verification, and speech recognition. The standard dynamic-programming algorithm for DTW takes $O(n^2)$ time, and there are conditional lower bounds showing that no algorithm can do substantially better. In many applications, however, the strings $x$ and $y$ may contain long runs of repeated letters, meaning that they can be compressed using run-length encoding. A natural question is whether the DTW-distance between these compressed strings can be computed efficiently in terms of the lengths $k$ and $\ell$ of the compressed strings. Recent work has shown how to achieve $O(k\ell^2 + \ell k^2)$ time, leaving open the question of whether a near-quadratic $\tilde{O}(k\ell)$-time algorithm might exist. We show that, if a small approximation loss is permitted, then a near-quadratic time algorithm is indeed possible: our algorithm computes a $(1 + ε)$-approximation for $DTW(x, y)$ in $\tilde{O}(k\ell / ε^3)$ time, where $k$ and $\ell$ are the number of runs in $x$ and $y$. Our algorithm allows for $DTW$ to be computed over any metric space $(Σ, δ)$ in which distances are $O(log(n))$-bit integers. Surprisingly, the algorithm also works even if $δ$ does not induce a metric space on $Σ$ (e.g., $δ$ need not satisfy the triangle inequality).

preprint2022arXiv

Balanced Allocations: The Heavily Loaded Case with Deletions

In the 2-choice allocation problem, $m$ balls are placed into $n$ bins, and each ball must choose between two random bins $i, j \in [n]$ that it has been assigned to. It has been known for more than two decades, that if each ball follows the Greedy strategy (i.e., always pick the less-full bin), then the maximum load will be $m/n + O(\log \log n)$ with high probability in $n$ (and $m / n + O(\log m)$ with high probability in $m$). It has remained open whether the same bounds hold in the dynamic version of the same game, where balls are inserted/deleted with up to $m$ balls present at a time. We show that these bounds do not hold in the dynamic setting: already on $4$ bins, there exists a sequence of insertions/deletions that cause {Greedy} to incur a maximum load of $m/4 + Ω(\sqrt{m})$ with probability $Ω(1)$ -- this is the same bound as if each ball is simply assigned to a random bin! This raises the question of whether any 2-choice allocation strategy can offer a strong bound in the dynamic setting. Our second result answers this question in the affirmative: we present a new strategy, called ModulatedGreedy, that guarantees a maximum load of $m / n + O(\log m)$, at any given moment, with high probability in $m$. Generalizing ModulatedGreedy, we obtain dynamic guarantees for the $(1 + β)$-choice setting, and for the setting of balls-and-bins on a graph. Finally, we consider a setting in which balls can be reinserted after they are deleted, and where the pair $i, j$ that a given ball uses is consistent across insertions. This seemingly small modification renders tight load balancing impossible: on 4 bins, any strategy that is oblivious to the specific identities of balls must allow for a maximum load of $m/4 + poly(m)$ at some point in the first $poly(m)$ insertions/deletions, with high probability in $m$.

preprint2022arXiv

Memoryless Worker-Task Assignment with Polylogarithmic Switching Cost

We study the basic problem of assigning memoryless workers to tasks with dynamically changing demands. Given a set of $w$ workers and a multiset $T \subseteq[t]$ of $|T|=w$ tasks, a memoryless worker-task assignment function is any function $ϕ$ that assigns the workers $[w]$ to the tasks $T$ based only on the current value of $T$. The assignment function $ϕ$ is said to have switching cost at most $k$ if, for every task multiset $T$, changing the contents of $T$ by one task changes $ϕ(T)$ by at most $k$ worker assignments. The goal of memoryless worker task assignment is to construct an assignment function with the smallest possible switching cost. In past work, the problem of determining the optimal switching cost has been posed as an open question. There are no known sub-linear upper bounds, and after considerable effort, the best known lower bound remains 4 (ICALP 2020). We show that it is possible to achieve polylogarithmic switching cost. We give a construction via the probabilistic method that achieves switching cost $O(\log w \log (wt))$ and an explicit construction that achieves switching cost $\operatorname{polylog} (wt)$. We also prove a super-constant lower bound on switching cost: we show that for any value of $w$, there exists a value of $t$ for which the optimal switching cost is $w$. Thus it is not possible to achieve a switching cost that is sublinear strictly as a function of $w$. Finally, we present an application of the worker-task assignment problem to a metric embeddings problem. In particular, we use our results to give the first low-distortion embedding from sparse binary vectors into low-dimensional Hamming space.

preprint2022arXiv

Online List Labeling: Breaking the $\log^2n$ Barrier

The online list labeling problem is an algorithmic primitive with a large literature of upper bounds, lower bounds, and applications. The goal is to store a dynamically-changing set of $n$ items in an array of $m$ slots, while maintaining the invariant that the items appear in sorted order, and while minimizing the relabeling cost, defined to be the number of items that are moved per insertion/deletion. For the linear regime, where $m = (1 + Θ(1)) n$, an upper bound of $O(\log^2 n)$ on the relabeling cost has been known since 1981. A lower bound of $Ω(\log^2 n)$ is known for deterministic algorithms and for so-called smooth algorithms, but the best general lower bound remains $Ω(\log n)$. The central open question in the field is whether $O(\log^2 n)$ is optimal for all algorithms. In this paper, we give a randomized data structure that achieves an expected relabeling cost of $O(\log^{3/2} n)$ per operation. More generally, if $m = (1 + \varepsilon) n$ for $\varepsilon = O(1)$, the expected relabeling cost becomes $O(\varepsilon^{-1} \log^{3/2} n)$. Our solution is history independent, meaning that the state of the data structure is independent of the order in which items are inserted/deleted. For history-independent data structures, we also prove a matching lower bound: for all $ε$ between $1 / n^{1/3}$ and some sufficiently small positive constant, the optimal expected cost for history-independent list-labeling solutions is $Θ(\varepsilon^{-1}\log^{3/2} n)$.

preprint2022arXiv

Tight Bounds for Monotone Minimal Perfect Hashing

The monotone minimal perfect hash function (MMPHF) problem is the following indexing problem. Given a set $S= \{s_1,\ldots,s_n\}$ of $n$ distinct keys from a universe $U$ of size $u$, create a data structure $DS$ that answers the following query: \[ RankOp(q) = \text{rank of } q \text{ in } S \text{ for all } q\in S ~\text{ and arbitrary answer otherwise.} \] Solutions to the MMPHF problem are in widespread use in both theory and practice. The best upper bound known for the problem encodes $DS$ in $O(n\log\log\log u)$ bits and performs queries in $O(\log u)$ time. It has been an open problem to either improve the space upper bound or to show that this somewhat odd looking bound is tight. In this paper, we show the latter: specifically that any data structure (deterministic or randomized) for monotone minimal perfect hashing of any collection of $n$ elements from a universe of size $u$ requires $Ω(n \cdot \log\log\log{u})$ expected bits to answer every query correctly. We achieve our lower bound by defining a graph $\mathbf{G}$ where the nodes are the possible ${u \choose n}$ inputs and where two nodes are adjacent if they cannot share the same $DS$. The size of $DS$ is then lower bounded by the log of the chromatic number of $\mathbf{G}$. Finally, we show that the fractional chromatic number (and hence the chromatic number) of $\mathbf{G}$ is lower bounded by $2^{Ω(n \log\log\log u)}$.

preprint2022arXiv

What Does Dynamic Optimality Mean in External Memory?

In this paper, we revisit the question of how the dynamic optimality of search trees should be defined in external memory. A defining characteristic of external-memory data structures is that there is a stark asymmetry between queries and inserts/updates/deletes: by making the former slightly asymptotically slower, one can make the latter significantly asymptotically faster (even allowing for operations with sub-constant amortized I/Os). This asymmetry makes it so that rotation-based search trees are not optimal (or even close to optimal) in insert/update/delete-heavy external-memory workloads. To study dynamic optimality for such workloads, one must consider a different class of data structures. The natural class of data structures to consider are what we call buffered-propagation trees. Such trees can adapt dynamically to the locality properties of an input sequence in order to optimize the interactions between different inserts/updates/deletes and queries. We also present a new form of beyond-worst-case analysis that allows for us to formally study a continuum between static and dynamic optimality. Finally, we give a novel data structure, called the \jellotree, that is statically optimal and that achieves dynamic optimality for a large natural class of inputs defined by our beyond-worst-case analysis.

preprint2021arXiv

The Variable-Processor Cup Game

The problem of scheduling tasks on $p$ processors so that no task ever gets too far behind is often described as a game with cups and water. In the $p$-processor cup game on $n$ cups, there are two players, a filler and an emptier, that take turns adding and removing water from a set of $n$ cups. In each turn, the filler adds $p$ units of water to the cups, placing at most $1$ unit of water in each cup, and then the emptier selects $p$ cups to remove up to $1$ unit of water from. The emptier's goal is to minimize the backlog, which is the height of the fullest cup. The $p$-processor cup game has been studied in many different settings, dating back to the late 1960's. All of the past work shares one common assumption: that $p$ is fixed. This paper initiates the study of what happens when the number of available processors $p$ varies over time, resulting in what we call the \emph{variable-processor cup game}. Remarkably, the optimal bounds for the variable-processor cup game differ dramatically from its classical counterpart. Whereas the $p$-processor cup has optimal backlog $Θ(\log n)$, the variable-processor game has optimal backlog $Θ(n)$. Moreover, there is an efficient filling strategy that yields backlog $Ω(n^{1 - ε})$ in quasi-polynomial time against any deterministic emptying strategy. We additionally show that straightforward uses of randomization cannot be used to help the emptier. In particular, for any positive constant $Δ$, and any $Δ$-greedy-like randomized emptying algorithm $\mathcal{A}$, there is a filling strategy that achieves backlog $Ω(n^{1 - ε})$ against $\mathcal{A}$ in quasi-polynomial time.

preprint2020arXiv

Contention Resolution Without Collision Detection

This paper focuses on the contention resolution problem on a shared communication channel that does not support collision detection. A shared communication channel is a multiple access channel, which consists of a sequence of synchronized time slots. Players on the channel may attempt to broadcast a packet (message) in any time slot. A player's broadcast succeeds if no other player broadcasts during that slot. If two or more players broadcast in the same time slot, then the broadcasts collide and both broadcasts fail. The lack of collision detection means that a player monitoring the channel cannot differentiate between the case of two or more players broadcasting in the same slot (a collision) and zero players broadcasting. In the contention-resolution problem, players arrive on the channel over time, and each player has one packet to transmit. The goal is to coordinate the players so that each player is able to successfully transmit its packet within reasonable time. However, the players can only communicate via the shared channel by choosing to either broadcast or not. A contention-resolution protocol is measured in terms of its throughput (channel utilization). Previous work on contention resolution that achieved constant throughput assumed that either players could detect collisions, or the players' arrival pattern is generated by a memoryless (non-adversarial) process. The foundational question answered by this paper is whether collision detection is a luxury or necessity when the objective is to achieve constant throughput. We show that even without collision detection, one can solve contention resolution, achieving constant throughput, with high probability.

preprint2020arXiv

In-Place Parallel-Partition Algorithms using Exclusive-Read-and-Write Memory

We present an in-place algorithm for the partition problem that has linear work and polylogarithmic span. The algorithm uses only exclusive read/write shared variables, and can be implemented using parallel-for-loops without any additional concurrency considerations (i.e., the algorithm is EREW). A key feature of the algorithm is that it exhibits provably optimal cache behavior, up to small-order factors. We also present a second in-place EREW algorithm for the partition problem that has linear work and span $O(\log n \cdot \log \log n)$, which is within an $O(\log\log n)$ factor of the optimal span. By using this low-span algorithm as a subroutine within the cache-friendly algorithm, we are able to obtain a single EREW algorithm that combines their theoretical guarantees: the algorithm achieves span $O(\log n \cdot \log \log n)$ and optimal cache behavior. As an immediate consequence, we also get an in-place EREW quicksort algorithm with work $O(n \log n)$, span $O(\log^2 n \cdot \log \log n)$. Whereas the standard EREW algorithm for parallel partitioning is memory-bandwidth bound on large numbers of cores, our cache-friendly algorithm is able to achieve near-ideal scaling in practice by avoiding the memory-bandwidth bottleneck. The algorithm's performance is comparable to that of the Blocked Strided Algorithm of Francis, Pannan, Frias, and Petit, which is the previous state-of-the art for parallel EREW sorting algorithms, but which lacks theoretical guarantees on its span and cache behavior.

preprint2016arXiv

Fast Concurrent Cuckoo Kick-Out Eviction Schemes for High-Density Tables

Cuckoo hashing guarantees constant-time lookups regardless of table density, making it a viable candidate for high-density tables. Cuckoo hashing insertions perform poorly at high table densities, however. In this paper, we mitigate this problem through the introduction of novel kick-out eviction algorithms. Experimentally, our algorithms reduce the number of bins viewed per insertion for high-density tables by as much as a factor of ten. We also introduce an optimistic concurrency scheme for transactional multi-writer cuckoo hash tables (not using hardware transactional memory). For delete-light workloads, one of our kick-out algorithms avoids all competition between insertions with high probability, and significantly reduces transaction-abort frequency. This result is extended to arbitrary workloads using a new synchronization mechanism called a claim flag.

preprint2016arXiv

Signed Enumeration of Upper-Right Corners in Path Shuffles

We resolve a conjecture of Albert and Bousquet-Melou enumerating quarter-plane walks with fixed horizontal and vertical projections according to their upper-right-corner count modulo 2. In doing this, we introduce a signed upper-right-corner count statistic. We find its distribution over planar walks with any choice of fixed horizontal and vertical projections. Additionally, we prove that the polynomial counting loops with a fixed horizontal and vertical projection according to the absolute value of their signed upper-right-corner count is $(x+1)$-positive. Finally, we conjecture an equivalence between $(x+1)$-positivity of the generating function for upper-right-corner count and signed upper-right-corner count.

preprint2014arXiv

A New Approach to Enumerating Statistics Modulo $n$

We find a new approach to computing the remainder of a polynomial modulo $x^n-1$; such a computation is called modular enumeration. Given a polynomial with coefficients from a commutative $\mathbb{Q}$-algebra, our first main result constructs the remainder simply from the coefficients of residues of the polynomial modulo $Φ_d(x)$ for each $d\mid n$. Since such residues can often be found to have nice values, this simplifies a number of modular enumeration problems; indeed in some cases, such residues are already known while the related modular enumeration problem has remained unsolved. We list six such cases which our technique makes easy to solve. Our second main result is a formula for the unique polynomial $a$ such that $a \equiv f \mod Φ_n(x)$ and $a\equiv 0 \mod x^d-1$ for each proper divisor $d$ of $n$. We find a formula for remainders of $q$-multinomial coefficients and for remainders of $q$-Catalan numbers modulo $q^n-1$, reducing each problem to a finite number of cases for any fixed $n$. In the prior case, we solve an open problem posed by Hartke and Radcliffe. In considering $q$-Catalan numbers modulo $q^n-1$, we discover a cyclic group operation on certain lattice paths which behaves predictably with regard to major index. We also make progress on a problem in modular enumeration on subset sums posed by Kitchloo and Pachter.

preprint2014arXiv

Counting Permutations Modulo Pattern-Replacement Equivalences for Three-Letter Patterns

We study a family of equivalence relations on $S_n$, the group of permutations on $n$ letters, created in a manner similar to that of the Knuth relation and the forgotten relation. For our purposes, two permutations are in the same equivalence class if one can be reached from the other through a series of pattern-replacements using patterns whose order permutations are in the same part of a predetermined partition of $S_c$. When the partition is of $S_3$ and has one nontrivial part and that part is of size greater than two, we provide formulas for the number of classes created in each previously unsolved case. When the partition is of $S_3$ and has two nontrivial parts, each of size two (as do the Knuth and forgotten relations), we enumerate the classes for $13$ of the $14$ unresolved cases. In two of these cases, enumerations arise which are the same as those yielded by the Knuth and forgotten relations. The reasons for this phenomenon are still largely a mystery.

preprint2014arXiv

New Results on Doubly Adjacent Pattern-Replacement Equivalences

In this paper, we consider the family of pattern-replacement equivalence relations referred to as the "indices and values adjacent" case. Each such equivalence is determined by a partition $P$ of a subset of $S_c$ for some $c$. In 2010, Linton, Propp, Roby, and West posed a number of open problems in the area of pattern-replacement equivalences. Five, in particular, have remained unsolved until now, the enumeration of equivalence classes under the $\{123, 132\}$-equivalence, under the $\{123, 321\}$-equivalence, under the $\{123, 132, 213\}$ equivalence, and under the $\{123, 132, 213, 321\}$-equivalence. We find formulas for three of the five equivalences and systems of representatives for the equivalence classes of the other two. We generalize our results to hold for all replacement partitions of $S_3$, as well as for an infinite family of other replacement partitions. In addition, we characterize the equivalence classes in $S_n$ under the $S_c$-equivalence, finding a generalization of Stanley's results on the $\{12, 21\}$-equivalence. To do this, we introduce a notion of confluence that often allows one to find a representative element in each equivalence class under a given equivalence relation. Using an inclusion-exclusion argument, we are able to use this to count the equivalence classes under equivalence relations satisfying certain conditions.

preprint2012arXiv

Lower central series of a free associative algebra over the integers and finite fields

Consider the free algebra A_n generated over Q by n generators x_1, ..., x_n. Interesting objects attached to A = A_n are members of its lower central series, L_i = L_i(A), defined inductively by L_1 = A, L_{i+1} = [A,L_{i}], and their associated graded components B_i = B_i(A) defined as B_i=L_i/L_{i+1}. These quotients B_i, for i at least 2, as well as the reduced quotient \bar{B}_1=A/(L_2+A L_3), exhibit a rich geometric structure, as shown by Feigin and Shoikhet and later authors, (Dobrovolska-Kim-Ma,Dobrovolska-Etingof,Arbesfeld-Jordan,Bapat-Jordan). We study the same problem over the integers Z and finite fields F_p. New phenomena arise, namely, torsion in B_i over Z, and jumps in dimension over F_p. We describe the torsion in the reduced quotient RB_1 and B_2 geometrically in terms of the De Rham cohomology of Z^n. As a corollary we obtain a complete description of \bar{B}_1(A_n(Z)) and \bar{B}_1(A_n(F_p)), as well as of B_2(A_n(Z[1/2])) and B_2(A_n(F_p)), p>2. We also give theoretical and experimental results for B_i with i>2, formulating a number of conjectures and questions based on them. Finally, we discuss the supercase, when some of the generators are odd (fermionic) and some are even (bosonic), and provide some theoretical results and experimental data in this case.

William Kuszmaul

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

Approximating Dynamic Time Warping Distance Between Run-Length Encoded Strings

Balanced Allocations: The Heavily Loaded Case with Deletions

Memoryless Worker-Task Assignment with Polylogarithmic Switching Cost

Online List Labeling: Breaking the $\log^2n$ Barrier

Tight Bounds for Monotone Minimal Perfect Hashing

What Does Dynamic Optimality Mean in External Memory?

The Variable-Processor Cup Game

Contention Resolution Without Collision Detection

In-Place Parallel-Partition Algorithms using Exclusive-Read-and-Write Memory

Fast Concurrent Cuckoo Kick-Out Eviction Schemes for High-Density Tables

Signed Enumeration of Upper-Right Corners in Path Shuffles

A New Approach to Enumerating Statistics Modulo $n$

Counting Permutations Modulo Pattern-Replacement Equivalences for Three-Letter Patterns

New Results on Doubly Adjacent Pattern-Replacement Equivalences

Lower central series of a free associative algebra over the integers and finite fields