Source author record

Julian Shun

Julian Shun appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Data Structures and Algorithms Databases Machine Learning Performance Programming Languages Computational Complexity Computational Geometry cond-mat.dis-nn math.PR Methodology

Catalog footprint

What is connected

20works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

Kairos: Efficient Temporal Graph Analytics on a Single Machine

Many important societal problems are naturally modeled as algorithms over temporal graphs. To date, however, most graph processing systems remain inefficient as they rely on distributed processing even for graphs that fit well within a commodity server's available storage. In this paper, we introduce Kairos, a temporal graph analytics system that provides application developers a framework for efficiently implementing and executing algorithms over temporal graphs on a single machine. Specifically, Kairos relies on fork-join parallelism and a highly optimized parallel data structure as core primitives to maximize performance of graph processing tasks needed for temporal graph analytics. Furthermore, we introduce the notion of selective indexing and show how it can be used with an efficient index to speedup temporal queries. Our experiments on a 24-core server show that our algorithms obtain good parallel speedups, and are significantly faster than equivalent algorithms in existing temporal graph processing systems: up to 60x against a shared-memory approach, and several orders of magnitude when compared with distributed processing of graphs that fit within a single server.

preprint2022arXiv

Parallel Batch-Dynamic Minimum Spanning Forest and the Efficiency of Dynamic Agglomerative Graph Clustering

Hierarchical agglomerative clustering (HAC) is a popular algorithm for clustering data, but despite its importance, no dynamic algorithms for HAC with good theoretical guarantees exist. In this paper, we study dynamic HAC on edge-weighted graphs. As single-linkage HAC reduces to computing a minimum spanning forest (MSF), our first result is a parallel batch-dynamic algorithm for maintaining MSFs. On a batch of $k$ edge insertions or deletions, our batch-dynamic MSF algorithm runs in $O(k\log^6 n)$ expected amortized work and $O(\log^4 n)$ span with high probability. It is the first fully dynamic MSF algorithm handling batches of edge updates with polylogarithmic work per update and polylogarithmic span. Using our MSF algorithm, we obtain a parallel batch-dynamic algorithm that can answer queries about single-linkage graph HAC clusters. Our second result is that dynamic graph HAC is significantly harder for other common linkage functions. For example, assuming the strong exponential time hypothesis, dynamic graph HAC requires $Ω(n^{1-o(1)})$ work per update or query on a graph with $n$ vertices for complete linkage, weighted average linkage, and average linkage. For complete linkage and weighted average linkage, the bound still holds even for incremental or decremental algorithms and even if we allow $\operatorname{poly}(n)$-approximation. For average linkage, the bound weakens to $Ω(n^{1/2 - o(1)})$ for incremental and decremental algorithms, and the bounds still hold when allowing $n^{o(1)}$-approximation.

preprint2022arXiv

ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

preprint2022arXiv

ParGeo: A Library for Parallel Computational Geometry

This paper presents ParGeo, a multicore library for computational geometry. ParGeo contains modules for fundamental tasks including $k$d-tree based spatial search, spatial graph generation, and algorithms in computational geometry. We focus on three new algorithmic contributions provided in the library. First, we present a new parallel convex hull algorithm based on a reservation technique to enable parallel modifications to the hull. We also provide the first parallel implementations of the randomized incremental convex hull algorithm as well as a divide-and-conquer convex hull algorithm in $\mathbb{R}^3$. Second, for the smallest enclosing ball problem, we propose a new sampling-based algorithm to quickly reduce the size of the data set. We also provide the first parallel implementation of Welzl's classic algorithm for smallest enclosing ball. Third, we present the BDL-tree, a parallel batch-dynamic $k$d-tree that allows for efficient parallel updates and $k$-NN queries over dynamically changing point sets. BDL-trees consist of a log-structured set of $k$d-trees which can be used to efficiently insert, delete, and query batches of points in parallel. On 36 cores with two-way hyper-threading, our fastest convex hull algorithm achieves up to 44.7x self-relative parallel speedup and up to 559x speedup against the best existing sequential implementation. Our smallest enclosing ball algorithm using our sampling-based algorithm achieves up to 27.1x self-relative parallel speedup and up to 178x speedup against the best existing sequential implementation. Our implementation of the BDL-tree achieves self-relative parallel speedup of up to 46.1x. Across all of the algorithms in ParGeo, we achieve self-relative parallel speedup of 8.1--46.61x.

preprint2022arXiv

Theoretically and Practically Efficient Parallel Nucleus Decomposition

This paper studies the nucleus decomposition problem, which has been shown to be useful in finding dense substructures in graphs. We present a novel parallel algorithm that is efficient both in theory and in practice. Our algorithm achieves a work complexity matching the best sequential algorithm while also having low depth (parallel running time), which significantly improves upon the only existing parallel nucleus decomposition algorithm (Sariyuce et al., PVLDB 2018). The key to the theoretical efficiency of our algorithm is the use of a theoretically-efficient parallel algorithms for clique listing and bucketing. We introduce several new practical optimizations, including a new multi-level hash table structure to store information on cliques space-efficiently and a technique for traversing this structure cache-efficiently. On a 30-core machine with two-way hyper-threading on real-world graphs, we achieve up to a 55x speedup over the state-of-the-art parallel nucleus decomposition algorithm by Sariyuce et al., and up to a 40x self-relative parallel speedup. We are able to efficiently compute larger nucleus decompositions than prior work on several million-scale graphs for the first time.

preprint2021arXiv

Compilation Techniques for Graph Algorithms on GPUs

The performance of graph programs depends highly on the algorithm, the size and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations and hardware platforms to use. The GraphIt programming language makes it easy for the programmer to write the algorithm once and optimize it for different inputs using a scheduling language. However, GraphIt currently has no support for generating high performance code for GPUs. Programmers must resort to re-implementing the entire algorithm from scratch in a low-level language with an entirely different set of abstractions and optimizations in order to achieve high performance on GPUs. We propose GG, an extension to the GraphIt compiler framework, that achieves high performance on both CPUs and GPUs using the same algorithm specification. GG significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that enables combining graph optimizations for GPUs. GG also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balancing by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within the L2 cache. We evaluate GG on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11x speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments.

preprint2021arXiv

Parallel In-Place Algorithms: Theory and Practice

Many parallel algorithms use at least linear auxiliary space in the size of the input to enable computations to be done independently without conflicts. Unfortunately, this extra space can be prohibitive for memory-limited machines, preventing large inputs from being processed. Therefore, it is desirable to design parallel in-place algorithms that use sublinear (or even polylogarithmic) auxiliary space. In this paper, we bridge the gap between theory and practice for parallel in-place (PIP) algorithms. We first define two computational models based on fork-join parallelism, which reflect modern parallel programming environments. We then introduce a variety of new parallel in-place algorithms that are simple and efficient, both in theory and in practice. Our algorithmic highlight is the Decomposable Property introduced in this paper, which enables existing non-in-place but highly-optimized parallel algorithms to be converted into parallel in-place algorithms. Using this property, we obtain algorithms for random permutation, list contraction, tree contraction, and merging that take linear work, $O(n^{1-ε})$ auxiliary space, and $O(n^ε\cdot\text{polylog}(n))$ span for $0<ε<1$. We also present new parallel in-place algorithms for scan, filter, merge, connectivity, biconnectivity, and minimum spanning forest using other techniques. In addition to theoretical results, we present experimental results for implementations of many of our parallel in-place algorithms. We show that on a 72-core machine with two-way hyper-threading, the parallel in-place algorithms usually outperform existing parallel algorithms for the same problems that use linear auxiliary space, indicating that the theory developed in this paper indeed leads to practical benefits in terms of both space usage and running time.

preprint2021arXiv

Theoretically-Efficient and Practical Parallel DBSCAN

The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take $O(n\log n)$ work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case, making them inefficient for large datasets. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with hyper-threading show that we outperform existing parallel DBSCAN implementations by up to several orders of magnitude, and achieve speedups by up to 33x over the best sequential algorithms.

preprint2020arXiv

Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks

Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive and thus were avoided at all costs. To that end, the primary goal of almost any existing partitioning scheme is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMA-enabled networks, this assumption is no longer valid. In fact, recent work has shown that distributed databases can scale even when the majority of transactions are cross-partition. In this paper, we first make the case that the new bottleneck which hinders truly scalable transaction processing in modern RDMA-enabled databases is data contention, and that optimizing for data contention leads to different partitioning layouts than optimizing for the number of distributed transactions. We then present Chiller, a new approach to data partitioning and transaction execution, which aims to minimize data contention for both local and distributed transactions. Finally, we evaluate Chiller using various workloads, and show that our partitioning and execution strategy outperforms traditional partitioning techniques which try to avoid distributed transactions, by up to a factor of 2.

preprint2020arXiv

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Connected components and spanning forest are fundamental graph algorithms due to their use in many important applications, such as graph clustering and image segmentation. GPUs are an ideal platform for graph algorithms due to their high peak performance and memory bandwidth. While there exist several GPU connectivity algorithms in the literature, many design choices have not yet been explored. In this paper, we explore various design choices in GPU connectivity algorithms, including sampling, linking, and tree compression, for both the static as well as the incremental setting. Our various design choices lead to over 300 new GPU implementations of connectivity, many of which outperform state-of-the-art. We present an experimental evaluation, and show that we achieve an average speedup of 2.47x speedup over existing static algorithms. In the incremental setting, we achieve a throughput of up to 48.23 billion edges per second. Compared to state-of-the-art CPU implementations on a 72-core machine, we achieve a speedup of 8.26--14.51x for static connectivity and 1.85--13.36x for incremental connectivity using a Tesla V100 GPU.

preprint2020arXiv

Improved Parallel Construction of Wavelet Trees and Rank/Select Structures

Existing parallel algorithms for wavelet tree construction have a work complexity of $O(n\logσ)$. This paper presents parallel algorithms for the problem with improved work complexity. Our first algorithm is based on parallel integer sorting and has either $O(n\log\log n\lceil\logσ/\sqrt{\log n\log\log n}\rceil)$ work and polylogarithmic depth, or $O(n\lceil\logσ/\sqrt{\log n}\rceil)$ work and sub-linear depth. We also describe another algorithm that has $O(n\lceil\logσ/\sqrt{\log n} \rceil)$ work and $O(σ+\log n)$ depth. We then show how to use similar ideas to construct variants of wavelet trees (arbitrary-shaped binary trees and multiary trees) as well as wavelet matrices in parallel with lower work complexity than prior algorithms. Finally, we show that the rank and select structures on binary sequences and multiary sequences, which are stored on wavelet tree nodes, can be constructed in parallel with improved work bounds, matching those of the best existing sequential algorithms for constructing rank and select structures.

preprint2020arXiv

Optimizing Ordered Graph Algorithms with GraphIt

Many graph problems can be solved using ordered parallel graph algorithms that achieve significant speedup over their unordered counterparts by reducing redundant work. This paper introduces a new priority-based extension to GraphIt, a domain-specific language for writing graph applications, to simplify writing high-performance parallel ordered graph algorithms. The extension enables vertices to be processed in a dynamic order while hiding low-level implementation details from the user. We extend the compiler with new program analyses, transformations, and code generation to produce fast implementations of ordered parallel graph algorithms. We also introduce bucket fusion, a new performance optimization that fuses together different rounds of ordered algorithms to reduce synchronization overhead, resulting in $1.2\times$--3$\times$ speedup over the fastest existing ordered algorithm implementations on road networks with large diameters. With the extension, GraphIt achieves up to 3$\times$ speedup on six ordered graph algorithms over state-of-the-art frameworks and hand-optimized implementations (Julienne, Galois, and GAPBS) that support ordered algorithms.

preprint2020arXiv

Parallel Algorithms for Butterfly Computations

Butterflies are the smallest non-trivial subgraph in bipartite graphs, and therefore having efficient computations for analyzing them is crucial to improving the quality of certain applications on bipartite graphs. In this paper, we design a framework called ParButterfly that contains new parallel algorithms for the following problems on processing butterflies: global counting, per-vertex counting, per-edge counting, tip decomposition (vertex peeling), and wing decomposition (edge peeling). The main component of these algorithms is aggregating wedges incident on subsets of vertices, and our framework supports different methods for wedge aggregation, including sorting, hashing, histogramming, and batching. In addition, ParButterfly supports different ways of ranking the vertices to speed up counting, including side ordering, approximate and exact degree ordering, and approximate and exact complement coreness ordering. For counting, ParButterfly also supports both exact computation as well as approximate computation via graph sparsification. We prove strong theoretical guarantees on the work and span of the algorithms in ParButterfly. We perform a comprehensive evaluation of all of the algorithms in ParButterfly on a collection of real-world bipartite graphs using a 48-core machine. Our counting algorithms obtain significant parallel speedup, outperforming the fastest sequential algorithms by up to 13.6x with a self-relative speedup of up to 38.5x. Compared to general subgraph counting solutions, we are orders of magnitude faster. Our peeling algorithms achieve self-relative speedups of up to 10.7x and outperform the fastest sequential baseline by up to several orders of magnitude.

preprint2020arXiv

Sage: Parallel Semi-Asymmetric Graph Algorithms for NVRAMs

Non-volatile main memory (NVRAM) technologies provide an attractive set of features for large-scale graph analytics, including byte-addressability, low idle power, and improved memory-density. NVRAM systems today have an order of magnitude more NVRAM than traditional memory (DRAM). NVRAM systems could therefore potentially allow very large graph problems to be solved on a single machine, at a modest cost. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be much more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics using the Parallel Semi-Asymmetric Model (PSAM), in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the PSAM approach assumes that the vertices of the graph fit in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other benefits. To experimentally study this new setting, we develop Sage, a parallel semi-asymmetric graph engine with which we implement provably-efficient (and often work-optimal) PSAM algorithms for over a dozen fundamental graph problems. We experimentally study Sage using a 48-core machine on the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) equipped with Optane DC Persistent Memory, and show that Sage outperforms the fastest prior systems designed for NVRAM. Importantly, we also show that Sage nearly matches the fastest prior systems running solely in DRAM, by effectively hiding the costs of repeatedly accessing NVRAM versus DRAM.

preprint2016arXiv

Efficient Algorithms with Asymmetric Read and Write Costs

In several emerging technologies for computer memory (main memory), the cost of reading is significantly cheaper than the cost of writing. Such asymmetry in memory costs poses a fundamentally different model from the RAM for algorithm design. In this paper we study lower and upper bounds for various problems under such asymmetric read and write costs. We consider both the case in which all but $O(1)$ memory has asymmetric cost, and the case of a small cache of symmetric memory. We model both cases using the $(M,ω)$-ARAM, in which there is a small (symmetric) memory of size $M$ and a large unbounded (asymmetric) memory, both random access, and where reading from the large memory has unit cost, but writing has cost $ω\gg 1$. For FFT and sorting networks we show a lower bound cost of $Ω(ωn\log_{ωM} n)$, which indicates that it is not possible to achieve asymptotic improvements with cheaper reads when $ω$ is bounded by a polynomial in $M$. Also, there is an asymptotic gap (of $\min(ω,\log n)/\log(ωM)$) between the cost of sorting networks and comparison sorting in the model. This contrasts with the RAM, and most other models. We also show a lower bound for computations on an $n\times n$ diamond DAG of $Ω(ωn^2/M)$ cost, which indicates no asymptotic improvement is achievable with fast reads. However, we show that for the edit distance problem (and related problems), which would seem to be a diamond DAG, there exists an algorithm with only $O(ωn^2/(M\min(ω^{1/3},M^{1/2})))$ cost. To achieve this we make use of a "path sketch" technique that is forbidden in a strict DAG computation. Finally, we show several interesting upper bounds for shortest path problems, minimum spanning trees, and other problems. A common theme in many of the upper bounds is to have redundant computation to tradeoff between reads and writes.

preprint2016arXiv

Sorting with Asymmetric Read and Write Costs

Emerging memory technologies have a significant gap between the cost, both in time and in energy, of writing to memory versus reading from memory. In this paper we present models and algorithms that account for this difference, with a focus on write-efficient sorting algorithms. First, we consider the PRAM model with asymmetric write cost, and show that sorting can be performed in $O\left(n\right)$ writes, $O\left(n \log n\right)$ reads, and logarithmic depth (parallel time). Next, we consider a variant of the External Memory (EM) model that charges $ω> 1$ for writing a block of size $B$ to the secondary memory, and present variants of three EM sorting algorithms (multi-way mergesort, sample sort, and heapsort using buffer trees) that asymptotically reduce the number of writes over the original algorithms, and perform roughly $ω$ block reads for every block write. Finally, we define a variant of the Ideal-Cache model with asymmetric write costs, and present write-efficient, cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication. Adapting prior bounds for work-stealing and parallel-depth-first schedulers to the asymmetric setting, these yield parallel cache complexity bounds for machines with private caches or with a shared cache, respectively.

preprint2015arXiv

Efficient Implementation of a Synchronous Parallel Push-Relabel Algorithm

Motivated by the observation that FIFO-based push-relabel algorithms are able to outperform highest label-based variants on modern, large maximum flow problem instances, we introduce an efficient implementation of the algorithm that uses coarse-grained parallelism to avoid the problems of existing parallel approaches. We demonstrate good relative and absolute speedups of our algorithm on a set of large graph instances taken from real-world applications. On a modern 40-core machine, our parallel implementation outperforms existing sequential implementations by up to a factor of 12 and other parallel implementations by factors of up to 3.

preprint2015arXiv

Parallel Wavelet Tree Construction

We present parallel algorithms for wavelet tree construction with polylogarithmic depth, improving upon the linear depth of the recent parallel algorithms by Fuentes-Sepulveda et al. We experimentally show on a 40-core machine with two-way hyper-threading that we outperform the existing parallel algorithms by 1.3--5.6x and achieve up to 27x speedup over the sequential algorithm on a variety of real-world and artificial inputs. Our algorithms show good scalability with increasing thread count, input size and alphabet size. We also discuss extensions to variants of the standard wavelet tree.

preprint2012arXiv

Greedy Sequential Maximal Independent Set and Matching are Parallel on Average

The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In this loop, as in many sequential loops, each iterate will only depend directly on a subset of the previous iterates (i.e. knowing that any one of a vertices neighbors is in the MIS or knowing that it has no previous neighbors is sufficient to decide its fate). This leads to a dependence structure among the iterates. If this structure is shallow then running the iterates in parallel while respecting the dependencies can lead to an efficient parallel implementation mimicking the sequential algorithm. In this paper, we show that for any graph, and for a random ordering of the vertices, the dependence depth of the sequential greedy MIS algorithm is polylogarithmic (O(log^2 n) with high probability). Our results extend previous results that show polylogarithmic bounds only for random graphs. We show similar results for a greedy maximal matching (MM). For both problems we describe simple linear work parallel algorithms based on the approach. The algorithms allow for a smooth tradeoff between more parallelism and reduced work, but always return the same result as the sequential greedy algorithms. We present experimental results that demonstrate efficiency and the tradeoff between work and parallelism.

preprint2011arXiv

Connected Spatial Networks over Random Points and a Route-Length Statistic

We review mathematically tractable models for connected networks on random points in the plane, emphasizing the class of proximity graphs which deserves to be better known to applied probabilists and statisticians. We introduce and motivate a particular statistic $R$ measuring shortness of routes in a network. We illustrate, via Monte Carlo in part, the trade-off between normalized network length and $R$ in a one-parameter family of proximity graphs. How close this family comes to the optimal trade-off over all possible networks remains an intriguing open question. The paper is a write-up of a talk developed by the first author during 2007--2009.

Julian Shun

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

Kairos: Efficient Temporal Graph Analytics on a Single Machine

Parallel Batch-Dynamic Minimum Spanning Forest and the Efficiency of Dynamic Agglomerative Graph Clustering

ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

ParGeo: A Library for Parallel Computational Geometry

Theoretically and Practically Efficient Parallel Nucleus Decomposition

Compilation Techniques for Graph Algorithms on GPUs

Parallel In-Place Algorithms: Theory and Practice

Theoretically-Efficient and Practical Parallel DBSCAN

Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Improved Parallel Construction of Wavelet Trees and Rank/Select Structures

Optimizing Ordered Graph Algorithms with GraphIt

Parallel Algorithms for Butterfly Computations

Sage: Parallel Semi-Asymmetric Graph Algorithms for NVRAMs

Efficient Algorithms with Asymmetric Read and Write Costs

Sorting with Asymmetric Read and Write Costs

Efficient Implementation of a Synchronous Parallel Push-Relabel Algorithm

Parallel Wavelet Tree Construction

Greedy Sequential Maximal Independent Set and Matching are Parallel on Average

Connected Spatial Networks over Random Points and a Route-Length Statistic