Source author record

Guy E. Blelloch

Guy E. Blelloch appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Data Structures and Algorithms Databases Numerical Analysis Programming Languages

Catalog footprint

What is connected

18works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Practically and Theoretically Efficient Garbage Collection for Multiversioning

Multiversioning is widely used in databases, transactional memory, and concurrent data structures. It can be used to support read-only transactions that appear atomic in the presence of concurrent update operations. Any system that maintains multiple versions of each object needs a way of efficiently reclaiming them. We experimentally compare various existing reclamation techniques by applying them to a multiversion tree and a multiversion hash table. Using insights from these experiments, we develop two new multiversion garbage collection (MVGC) techniques. These techniques use two novel concurrent version list data structures. Our experimental evaluation shows that our fastest technique is competitive with the fastest existing MVGC techniques, while using significantly less space on some workloads. Our new techniques provide strong theoretical bounds, especially on space usage. These bounds ensure that the schemes have consistent performance, avoiding the very high worst-case space usage of other techniques.

preprint2022arXiv

Lock-Free Locks Revisited

This paper presents a new and practical approach to lock-free locks based on helping, which allows the user to write code using fine-grained locks, but run it in a lock-free manner. Although lock-free locks have been suggested in the past, they are widely viewed as impractical, have some key limitations, and, as far as we know, have never been implemented. The paper presents some key techniques that make lock-free locks practical and more general. The most important technique is an approach to idempotence -- i.e. making code that runs multiple times appear as if it ran once. The idea is based on using a shared log among processes running the same protected code. Importantly, the approach can be library based, requiring very little if any change to standard code -- code just needs to use the idempotent versions of memory operations (load, store, LL/SC, allocation, free). We have implemented a C++ library called Flock based on the ideas. Flock allows lock-based data structures to run in either lock-free or blocking (traditional locks) mode. We implemented a variety of tree and list-based data structures with Flock and compare the performance of the lock-free and blocking modes under a variety of workloads. The lock-free mode is almost as fast as blocking mode under almost all workloads, and significantly faster when threads are oversubscribed (more threads than processors). We also compare with several existing lock-based and lock-free alternatives.

preprint2022arXiv

PaC-trees: Supporting Parallel and Compressed Purely-Functional Collections

Many modern programming languages are shifting toward a functional style for collection interfaces such as sets, maps, and sequences. Functional interfaces offer many advantages, including being safe for parallelism and providing simple and lightweight snapshots. However, existing high-performance functional interfaces such as PAM, which are based on balanced purely-functional trees, incur large space overheads for large-scale data analysis due to storing every element in a separate node in a tree. This paper presents PaC-trees, a purely-functional data structure supporting functional interfaces for sets, maps, and sequences that provides a significant reduction in space over existing approaches. A PaC-tree is a balanced binary search tree which blocks the leaves and compresses the blocks using arrays. We provide novel techniques for compressing and uncompressing the blocks which yield practical parallel functional algorithms for a broad set of operations on PaC-trees such as union, intersection, filter, reduction, and range queries which are both theoretically and practically efficient. Using PaC-trees we designed CPAM, a C++ library that implements the full functionality of PAM, while offering significant extra functionality for compression. CPAM consistently matches or outperforms PAM on a set of microbenchmarks on sets, maps, and sequences while using about a quarter of the space. On applications including inverted indices, 2D range queries, and 1D interval queries, CPAM is competitive with or faster than PAM, while using 2.1--7.8x less space. For static and streaming graph processing, CPAM offers 1.6x faster batch updates while using 1.3--2.6x less space than the state-of-the-art graph processing system Aspen.

preprint2022arXiv

Turning Manual Concurrent Memory Reclamation into Automatic Reference Counting

Safe memory reclamation (SMR) schemes are an essential tool for lock-free data structures and concurrent programming. However, manual SMR schemes are notoriously difficult to apply correctly, and automatic schemes, such as reference counting, have been argued for over a decade to be too slow for practical purposes. A recent wave of work has disproved this long-held notion and shown that reference counting can be as scalable as hazard pointers, one of the most common manual techniques. Despite these tremendous improvements, there remains a gap of up to 2x or more in performance between these schemes and faster manual techniques such as epoch-based reclamation (EBR). In this work, we first advance these ideas and show that in many cases, automatic reference counting can in fact be as fast as the fastest manual SMR techniques. We generalize our previous Concurrent Deferred Reference Counting (CDRC) algorithm to obtain a method for converting any standard manual SMR technique into an automatic reference counting technique with a similar performance profile. Our second contribution is extending this framework to support weak pointers, which are reference-counted pointers that automatically break pointer cycles by not contributing to the reference count, thus addressing a common weakness in reference-counted garbage collection. Our experiments with a C++-library implementation show that our automatic techniques perform in line with their manual counterparts, and that our weak pointer implementation outperforms the best known atomic weak pointer library by up to an order of magnitude on high thread counts. All together, we show that the ease of use of automatic memory management can be achieved without significant cost to practical performance or general applicability.

preprint2020arXiv

Concurrent Fixed-Size Allocation and Free in Constant Time

Our goal is to efficiently solve the dynamic memory allocation problem in a concurrent setting where processes run asynchronously. On $p$ processes, we can support allocation and free for fixed-sized blocks with $O(1)$ worst-case time per operation, $Θ(p^2)$ additive space overhead, and using only single-word read, write, and CAS. While many algorithms rely on having constant-time fixed-size allocate and free, we present the first implementation of these two operations that is constant time with reasonable space overhead.

preprint2020arXiv

Concurrent Reference Counting and Resource Management in Wait-free Constant Time

A common problem when implementing concurrent programs is efficiently protecting against unsafe races between processes reading and then using a resource (e.g., memory blocks, file descriptors, or network connections) and other processes that are concurrently overwriting and then destructing the same resource. Such read-destruct races can be protected with locks, or with lock-free solutions such as hazard-pointers or read-copy-update (RCU). In this paper we describe a method for protecting read-destruct races with expected constant time overhead, $O(P^2)$ space and $O(P^2)$ delayed destructs, and with just single word atomic memory operations (reads, writes, and CAS). It is based on an interface with four primitives, an acquire-release pair to protect accesses, and a retire-eject pair to delay the destruct until it is safe. We refer to this as the acquire-retire interface. Using the acquire-retire interface, we develop simple implementations for three common use cases: (1) memory reclamation with applications to stacks and queues, (2) reference counted objects, and (3) objects manage by ownership with moves, copies, and destructs. The first two results significantly improve on previous results, and the third application is original. Importantly, all operations have expected constant time overhead.

preprint2020arXiv

Constant-Time Snapshots with Applications to Concurrent Data Structures

We present an approach for efficiently taking snapshots of the state of a collection of CAS objects. Taking a snapshot allows later operations to read the value that each CAS object had at the time the snapshot was taken. Taking a snapshot requires a constant number of steps and returns a handle to the snapshot. Reading a snapshotted value of an individual CAS object using this handle is wait-free, taking time proportional to the number of successful CASes on the object since the snapshot was taken. Our fast, flexible snapshots yield simple, efficient implementations of atomic multi-point queries on concurrent data structures built from CAS objects. For example, in a search tree where child pointers are updated using CAS, once a snapshot is taken, one can atomically search for ranges of keys, find the first key that matches some criteria, or check if a collection of keys are all present, simply by running a standard sequential algorithm on a snapshot of the tree. To evaluate the performance of our approach, we apply it to two search trees, one balanced and one not. Experiments show that the overhead of supporting snapshots is low across a variety of workloads. Moreover, in almost all cases, range queries on the trees built from our snapshots perform as well as or better than state-of-the-art concurrent data structures that support atomic range queries.

preprint2020arXiv

Delay-Free Concurrency on Faulty Persistent Memory

Non-volatile memory (NVM) promises persistent main memory that remains correct despite loss of power. This has sparked a line of research into algorithms that can recover from a system crash. Since caches are expected to remain volatile, concurrent data structures and algorithms must be redesigned to guarantee that they are left in a consistent state after a system crash, and that the execution can be continued upon recovery. However, the prospect of redesigning every concurrent data structure or algorithm before it can be used in NVM architectures is daunting. In this paper, we present a construction that takes any concurrent program with reads, writes and CASs to shared memory and makes it persistent, i.e., can be continued after one or more processes fault and have to restart. Importantly the converted algorithm has constant computational delay (preserves instruction counts on each process within a constant factor), as well as constant recovery delay (a process can recover from a fault in a constant number of instructions). We show this first for a simple transformation, and then present optimizations to make it more practical, allowing for a tradeoff for better constant factors in computational delay, for sometimes increased recovery delay. We also provide an optimized transformation that works for any normalized lock-free data structure, thus allowing more efficient constructions for a large class of concurrent algorithms. We experimentally evaluate our transformations by applying them to a queue.

preprint2020arXiv

LL/SC and Atomic Copy: Constant Time, Space Efficient Implementations using only pointer-width CAS

When designing concurrent algorithms, Load-Link/Store-Conditional (LL/SC) is often the ideal primitive to have because unlike Compare and Swap (CAS), LL/SC is immune to the ABA problem. However, the full semantics of LL/SC are not supported by any modern machine, so there has been a significant amount of work on simulations of LL/SC using Compare and Swap (CAS), a synchronization primitive that enjoys widespread hardware support. All of the algorithms so far that are constant time either use unbounded sequence numbers (and thus base objects of unbounded size), or require $Ω(MP)$ space for $M$ LL/SC object (where $P$ is the number of processes). We present a constant time implementation of $M$ LL/SC objects using $Θ(M+kP^2)$ space, where $k$ is the maximum number of overlapping LL/SC operations per process (usually a constant), and requiring only pointer-sized CAS objects. Our implementation can also be used to implement $L$-word $LL/SC$ objects in $Θ(L)$ time (for both $LL$ and $SC$) and $Θ((M+kP^2)L)$ space. To achieve these bounds, we begin by implementing a new primitive called Single-Writer Copy which takes a pointer to a word sized memory location and atomically copies its contents into another object. The restriction is that only one process is allowed to write/copy into the destination object at a time. We believe this primitive will be very useful in designing other concurrent algorithms as well.

preprint2020arXiv

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain). The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an $Ω(\log n)$ overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized.

preprint2020arXiv

Parallel Batch-Dynamic Graph Connectivity

In this paper, we study batch parallel algorithms for the dynamic connectivity problem, a fundamental problem that has received considerable attention in the sequential setting. The most well known sequential algorithm for dynamic connectivity is the elegant level-set algorithm of Holm, de Lichtenberg and Thorup (HDT), which achieves $O(\log^2 n)$ amortized time per edge insertion or deletion, and $O(\log n / \log\log n)$ time per query. We design a parallel batch-dynamic connectivity algorithm that is work-efficient with respect to the HDT algorithm for small batch sizes, and is asymptotically faster when the average batch size is sufficiently large. Given a sequence of batched updates, where $Δ$ is the average batch size of all deletions, our algorithm achieves $O(\log n \log(1 + n / Δ))$ expected amortized work per edge insertion and deletion and $O(\log^3 n)$ depth w.h.p. Our algorithm answers a batch of $k$ connectivity queries in $O(k \log(1 + n/k))$ expected work and $O(\log n)$ depth w.h.p. To the best of our knowledge, our algorithm is the first parallel batch-dynamic algorithm for connectivity.

preprint2020arXiv

Sage: Parallel Semi-Asymmetric Graph Algorithms for NVRAMs

Non-volatile main memory (NVRAM) technologies provide an attractive set of features for large-scale graph analytics, including byte-addressability, low idle power, and improved memory-density. NVRAM systems today have an order of magnitude more NVRAM than traditional memory (DRAM). NVRAM systems could therefore potentially allow very large graph problems to be solved on a single machine, at a modest cost. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be much more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics using the Parallel Semi-Asymmetric Model (PSAM), in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the PSAM approach assumes that the vertices of the graph fit in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other benefits. To experimentally study this new setting, we develop Sage, a parallel semi-asymmetric graph engine with which we implement provably-efficient (and often work-optimal) PSAM algorithms for over a dozen fundamental graph problems. We experimentally study Sage using a 48-core machine on the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) equipped with Optane DC Persistent Memory, and show that Sage outperforms the fastest prior systems designed for NVRAM. Importantly, we also show that Sage nearly matches the fastest prior systems running solely in DRAM, by effectively hiding the costs of repeatedly accessing NVRAM versus DRAM.

preprint2016arXiv

Efficient Algorithms with Asymmetric Read and Write Costs

In several emerging technologies for computer memory (main memory), the cost of reading is significantly cheaper than the cost of writing. Such asymmetry in memory costs poses a fundamentally different model from the RAM for algorithm design. In this paper we study lower and upper bounds for various problems under such asymmetric read and write costs. We consider both the case in which all but $O(1)$ memory has asymmetric cost, and the case of a small cache of symmetric memory. We model both cases using the $(M,ω)$-ARAM, in which there is a small (symmetric) memory of size $M$ and a large unbounded (asymmetric) memory, both random access, and where reading from the large memory has unit cost, but writing has cost $ω\gg 1$. For FFT and sorting networks we show a lower bound cost of $Ω(ωn\log_{ωM} n)$, which indicates that it is not possible to achieve asymptotic improvements with cheaper reads when $ω$ is bounded by a polynomial in $M$. Also, there is an asymptotic gap (of $\min(ω,\log n)/\log(ωM)$) between the cost of sorting networks and comparison sorting in the model. This contrasts with the RAM, and most other models. We also show a lower bound for computations on an $n\times n$ diamond DAG of $Ω(ωn^2/M)$ cost, which indicates no asymptotic improvement is achievable with fast reads. However, we show that for the edit distance problem (and related problems), which would seem to be a diamond DAG, there exists an algorithm with only $O(ωn^2/(M\min(ω^{1/3},M^{1/2})))$ cost. To achieve this we make use of a "path sketch" technique that is forbidden in a strict DAG computation. Finally, we show several interesting upper bounds for shortest path problems, minimum spanning trees, and other problems. A common theme in many of the upper bounds is to have redundant computation to tradeoff between reads and writes.

preprint2016arXiv

Parallel Shortest-Paths Using Radius Stepping

The single-source shortest path problem (SSSP) with nonnegative edge weights is a notoriously difficult problem to solve efficiently in parallel---it is one of the graph problems said to suffer from the transitive-closure bottleneck. In practice, the $Δ$-stepping algorithm of Meyer and Sanders (J. Algorithms, 2003) often works efficiently but has no known theoretical bounds on general graphs. The algorithm takes a sequence of steps, each increasing the radius by a user-specified value $Δ$. Each step settles the vertices in its annulus but can take $Θ(n)$ substeps, each requiring $Θ(m)$ work ($n$ vertices and $m$ edges). In this paper, we describe Radius-Stepping, an algorithm with the best-known tradeoff between work and depth bounds for SSSP with nearly-linear ($\otilde(m)$) work. The algorithm is a $Δ$-stepping-like algorithm but uses a variable instead of fixed-size increase in radii, allowing us to prove a bound on the number of steps. In particular, by using what we define as a vertex $k$-radius, each step takes at most $k+2$ substeps. Furthermore, we define a $(k, ρ)$-graph property and show that if an undirected graph has this property, then the number of steps can be bounded by $O(\frac{n}ρ \log ρL)$, for a total of $O(\frac{kn}ρ \log ρL)$ substeps, each parallel. We describe how to preprocess a graph to have this property. Altogether, Radius-Stepping takes $O((m+n\log n)\log \frac{n}ρ)$ work and $O(\frac{n}ρ\log n \log (ρL))$ depth per source after preprocessing. The preprocessing step can be done in $O(m\log n + nρ^2)$ work and $O(ρ^2)$ depth or in $O(m\log n + nρ^2\log n)$ work and $O(ρ\log ρ)$ depth, and adds no more than $O(nρ)$ edges.

preprint2016arXiv

Sorting with Asymmetric Read and Write Costs

Emerging memory technologies have a significant gap between the cost, both in time and in energy, of writing to memory versus reading from memory. In this paper we present models and algorithms that account for this difference, with a focus on write-efficient sorting algorithms. First, we consider the PRAM model with asymmetric write cost, and show that sorting can be performed in $O\left(n\right)$ writes, $O\left(n \log n\right)$ reads, and logarithmic depth (parallel time). Next, we consider a variant of the External Memory (EM) model that charges $ω> 1$ for writing a block of size $B$ to the secondary memory, and present variants of three EM sorting algorithms (multi-way mergesort, sample sort, and heapsort using buffer trees) that asymptotically reduce the number of writes over the original algorithms, and perform roughly $ω$ block reads for every block write. Finally, we define a variant of the Ideal-Cache model with asymmetric write costs, and present write-efficient, cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication. Adapting prior bounds for work-stealing and parallel-depth-first schedulers to the asymmetric setting, these yield parallel cache complexity bounds for machines with private caches or with a shared cache, respectively.

preprint2011arXiv

Near Linear-Work Parallel SDD Solvers, Low-Diameter Decomposition, and Low-Stretch Subgraphs

We present the design and analysis of a near linear-work parallel algorithm for solving symmetric diagonally dominant (SDD) linear systems. On input of a SDD $n$-by-$n$ matrix $A$ with $m$ non-zero entries and a vector $b$, our algorithm computes a vector $\tilde{x}$ such that $\norm[A]{\tilde{x} - A^+b} \leq \vareps \cdot \norm[A]{A^+b}$ in $O(m\log^{O(1)}{n}\log{\frac1ε})$ work and $O(m^{1/3+θ}\log \frac1ε)$ depth for any fixed $θ> 0$. The algorithm relies on a parallel algorithm for generating low-stretch spanning trees or spanning subgraphs. To this end, we first develop a parallel decomposition algorithm that in polylogarithmic depth and $\otilde(|E|)$ work, partitions a graph into components with polylogarithmic diameter such that only a small fraction of the original edges are between the components. This can be used to generate low-stretch spanning trees with average stretch $O(n^α)$ in $O(n^{1+α})$ work and $O(n^α)$ depth. Alternatively, it can be used to generate spanning subgraphs with polylogarithmic average stretch in $\otilde(|E|)$ work and polylogarithmic depth. We apply this subgraph construction to derive a parallel linear system solver. By using this solver in known applications, our results imply improved parallel randomized algorithms for several problems, including single-source shortest paths, maximum flow, minimum-cost flow, and approximate maximum flow.

preprint2011arXiv

Selective Memoization

This paper presents language techniques for applying memoization selectively. The techniques provide programmer control over equality, space usage, and identification of precise dependences so that memoization can be applied according to the needs of an application. Two key properties of the approach are that it accepts and efficient implementation and yields programs whose performance can be analyzed using standard analysis techniques. We describe our approach in the context of a functional language called MFL and an implementation as a Standard ML library. The MFL language employs a modal type system to enable the programmer to express programs that reveal their true data dependences when executed. We prove that the MFL language is sound by showing that that MFL programs yield the same result as they would with respect to a standard, non-memoizing semantics. The SML implementation cannot support the modal type system of MFL statically but instead employs run-time checks to ensure correct usage of primitives.

preprint2010arXiv

Parallel Approximation Algorithms for Facility-Location Problems

This paper presents the design and analysis of parallel approximation algorithms for facility-location problems, including $\NC$ and $\RNC$ algorithms for (metric) facility location, $k$-center, $k$-median, and $k$-means. These problems have received considerable attention during the past decades from the approximation algorithms community, concentrating primarily on improving the approximation guarantees. In this paper, we ask, is it possible to parallelize some of the beautiful results from the sequential setting? Our starting point is a small, but diverse, subset of results in approximation algorithms for facility-location problems, with a primary goal of developing techniques for devising their efficient parallel counterparts. We focus on giving algorithms with low depth, near work efficiency (compared to the sequential versions), and low cache complexity. Common in algorithms we present is the idea that instead of picking only the most cost-effective element, we make room for parallelism by allowing a small slack (e.g., a $(1+\vareps)$ factor) in what can be selected---then, we use a clean-up step to ensure that the behavior does not deviate too much from the sequential steps. All the algorithms we developed are ``cache efficient'' in that the cache complexity is bounded by $O(w/B)$, where $w$ is the work in the EREW model and $B$ is the block size.

Guy E. Blelloch

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Practically and Theoretically Efficient Garbage Collection for Multiversioning

Lock-Free Locks Revisited

PaC-trees: Supporting Parallel and Compressed Purely-Functional Collections

Turning Manual Concurrent Memory Reclamation into Automatic Reference Counting

Concurrent Fixed-Size Allocation and Free in Constant Time

Concurrent Reference Counting and Resource Management in Wait-free Constant Time

Constant-Time Snapshots with Applications to Concurrent Data Structures

Delay-Free Concurrency on Faulty Persistent Memory

LL/SC and Atomic Copy: Constant Time, Space Efficient Implementations using only pointer-width CAS

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

Parallel Batch-Dynamic Graph Connectivity

Sage: Parallel Semi-Asymmetric Graph Algorithms for NVRAMs

Efficient Algorithms with Asymmetric Read and Write Costs

Parallel Shortest-Paths Using Radius Stepping

Sorting with Asymmetric Read and Write Costs

Near Linear-Work Parallel SDD Solvers, Low-Diameter Decomposition, and Low-Stretch Subgraphs

Selective Memoization

Parallel Approximation Algorithms for Facility-Location Problems