Source author record

Peter Sanders

Peter Sanders appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Neural and Evolutionary Computing Social and Information Networks Information Retrieval Logic in Computer Science Cell Behavior Computer Vision Machine Learning math.CO math.OC Performance physics.soc-ph Quantitative Methods Systems and Control

Catalog footprint

What is connected

54works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Fast Succinct Retrieval and Approximate Membership using Ribbon

A retrieval data structure for a static function $f:S\rightarrow \{0,1\}^r$ supports queries that return $f(x)$ for any $x \in S$. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ), i.e., a Bloom filter alternative, with false positive rate $2^{-r}$. The information-theoretic lower bound for both tasks is $r|S|$ bits. While succinct theoretical constructions using $(1+o(1))r|S|$ bits were known, these could not achieve very small overheads in practice because they have an unfavorable space--time tradeoff hidden in the asymptotic costs or because small overheads would only be reached for physically impossible input sizes. With bumped ribbon retrieval (BuRR), we present the first practical succinct retrieval data structure. In an extensive experimental evaluation BuRR achieves space overheads well below 1\,\% while being faster than most previously used retrieval data structures (typically with space overheads at least an order of magnitude larger) and faster than classical Bloom filters (with space overhead $\geq 44\,\%$). This efficiency, including favorable constants, stems from a combination of simplicity, word parallelism, and high locality. We additionally describe homogeneous ribbon filter AMQs, which are even simpler and faster at the price of slightly larger space overhead.

preprint2022arXiv

More Recent Advances in (Hyper)Graph Partitioning

In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the last decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic. In particular, the survey extends the previous survey by also covering hypergraph partitioning and streaming algorithms, and has an additional focus on parallel algorithms.

preprint2022arXiv

Parallel Flow-Based Hypergraph Partitioning

We present a shared-memory parallelization of flow-based refinement, which is considered the most powerful iterative improvement technique for hypergraph partitioning at the moment. Flow-based refinement works on bipartitions, so current sequential partitioners schedule it on different block pairs to improve $k$-way partitions. We investigate two different sources of parallelism: a parallel scheduling scheme and a parallel maximum flow algorithm based on the well-known push-relabel algorithm. In addition to thoroughly engineered implementations, we propose several optimizations that substantially accelerate the algorithm in practice, enabling the use on extremely large hypergraphs (up to 1 billion pins). We integrate our approach in the state-of-the-art parallel multilevel framework Mt-KaHyPar and conduct extensive experiments on a benchmark set of more than 500 real-world hypergraphs, to show that the partition quality of our code is on par with the highest quality sequential code (KaHyPar), while being an order of magnitude faster with 10 threads.

preprint2022arXiv

Scalable SAT Solving in the Cloud

Previous efforts on making Satisfiability (SAT) solving fit for high performance computing (HPC) have lead to super-linear speedups on particular formulae, but for most inputs cannot make efficient use of a large number of processors. Moreover, long latencies (minutes to days) of job scheduling make large-scale SAT solving on demand impractical for most applications. We address both issues with Mallob, a framework for job scheduling in the context of SAT solving which exploits malleability, i.e., the ability to add or remove processing power from a job during its computation. Mallob includes a massively parallel, distributed, and malleable SAT solving engine based on Hordesat with a more succinct and communication-efficient approach to clause sharing and numerous further improvements over its precursor. For example, Mallob on 640 cores outperforms an updated and improved configuration of Hordesat on 2560 cores. Moreover, Mallob can also solve many formulae in parallel while dynamically adapting the assigned resources, and jobs arriving in the system are usually initiated within a fraction of a second.

preprint2022arXiv

Vectorized and performance-portable Quicksort

Recent works showed that implementations of Quicksort using vector CPU instructions can outperform the non-vectorized algorithms in widespread use. However, these implementations are typically single-threaded, implemented for a particular instruction set, and restricted to a small set of key types. We lift these three restrictions: our proposed 'vqsort' algorithm integrates into the state-of-the-art parallel sorter 'ips4o', with a geometric mean speedup of 1.59. The same implementation works on seven instruction sets (including SVE and RISC-V V) across four platforms. It also supports floating-point and 16-128 bit integer keys. To the best of our knowledge, this is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries. This paper focuses on the practical engineering aspects enabling the speed and portability, which we have not yet seen demonstrated for a Quicksort implementation. Furthermore, we introduce compact and transpose-free sorting networks for in-register sorting of small arrays, and a vector-friendly pivot sampling strategy that is robust against adversarial input.

preprint2022arXiv

Weighted Random Sampling on GPUs

An alias table is a data structure that allows for efficiently drawing weighted random samples in constant time and can be constructed in linear time. The PSA algorithm by Hübschle-Schneider and Sanders is able to construct alias tables in parallel on the CPU. In this report, we transfer the PSA algorithm to the GPU. Our construction algorithm achieves a speedup of 17 on a consumer GPU in comparison to the PSA method on a 16-core high-end desktop CPU. For sampling, we achieve an up to 24 times higher throughput. Both operations also require several times less energy than on the CPU. Adaptations helping to achieve this include changing memory access patterns to do coalesced access. Where this is not possible, we first copy data to the faster shared memory using coalesced access. We also enhance a generalization of binary search enabling to search for a range of items in parallel. Besides naive sampling, we also give improved batched sampling algorithms.

preprint2021arXiv

Engineering In-place (Shared-memory) Sorting Algorithms

We present sorting algorithms that represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. A part of the speed advantage is due to the feature to work in-place. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our comparison-based algorithm, In-place Superscalar Samplesort (IPS$^4$o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best in-place parallel comparison-based competitor by almost a factor of three. IPS$^4$o also outperforms the best comparison-based competitors in the in-place or not in-place, parallel or sequential settings. IPS$^4$o even outperforms the best integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new in-place radix sorter turns out to be the best algorithm. Claims to have the, in some sense, "best" sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on extensive experiments involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications.

preprint2020arXiv

Communication-Efficient (Weighted) Reservoir Sampling from Fully Distributed Data Streams

We consider communication-efficient weighted and unweighted (uniform) random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our goal is to showcase its usefulness. We present and analyze fully distributed, communication-efficient algorithms for both versions of the problem. An experimental evaluation of weighted reservoir sampling on up to 256 nodes (5120 processors) shows good speedups, while theoretical analysis promises further scaling to much larger machines.

preprint2020arXiv

Communication-Efficient String Sorting

There has been surprisingly little work on algorithms for sorting strings on distributed-memory parallel machines. We develop efficient algorithms for this problem based on the multi-way merging principle. These algorithms inspect only characters that are needed to determine the sorting order. Moreover, communication volume is reduced by also communicating (roughly) only those characters and by communicating repetitions of the same prefixes only once. Experiments on up to 1280 cores reveal that these algorithm are often more than five times faster than previous algorithms.

preprint2020arXiv

Connecting MapReduce Computations to Realistic Machine Models

We explain how the popular, highly abstract MapReduce model of parallel computation (MRC) can be rooted in reality by explaining how it can be simulated on realistic distributed-memory parallel machine models like BSP. We first refine the model (MRC$^+$) to include parameters for total work $w$, bottleneck work $\hat{w}$, data volume $m$, and maximum object sizes $\hat{m}$. We then show matching upper and lower bounds for executing a MapReduce calculation on the distributed-memory machine -- $Θ(w/p+\hat{w}+\log p)$ work and $Θ(m/p+\hat{m}+\log p)$ bottleneck communication volume using $p$ processors.

preprint2020arXiv

KaHIP v3.00 -- Karlsruhe High Quality Partitioning -- User Guide

This paper severs as a user guide to the graph partitioning framework KaHIP (Karlsruhe High Quality Partitioning). We give a rough overview of the techniques used within the framework and describe the user interface as well as the file formats used. Moreover, we provide a short description of the current library functions provided within the framework. Since version 3.00 we support multilevel partitioning, memetic algorithms, distributed and shared-memory parallel algorithms, node separator and ordering algorithms, edge partitioning algorithms as well as ILP solvers.

preprint2020arXiv

Recent Advances in Scalable Network Generation

Random graph models are frequently used as a controllable and versatile data source for experimental campaigns in various research fields. Generating such data-sets at scale is a non-trivial task as it requires design decisions typically spanning multiple areas of expertise. Challenges begin with the identification of relevant domain-specific network features, continue with the question of how to compile such features into a tractable model, and culminate in algorithmic details arising while implementing the pertaining model. In the present survey, we explore crucial aspects of random graph models with known scalable generators. We begin by briefly introducing network features considered by such models, and then discuss random graphs alongside with generation algorithms. Our focus lies on modelling techniques and algorithmic primitives that have proven successful in obtaining massive graphs. We consider concepts and graph models for various domains (such as social network, infrastructure, ecology, and numerical simulations), and discuss generators for different models of computation (including shared-memory parallelism, massive-parallel GPUs, and distributed systems).

preprint2020arXiv

Robust Massively Parallel Sorting

We investigate distributed memory parallel sorting algorithms that scale to the largest available machines and are robust with respect to input size and distribution of the input elements. The main outcome is that four sorting algorithms cover the entire range of possible input sizes. For three algorithms we devise new low overhead mechanisms to make them robust with respect to duplicate keys and skewed input distributions. One of these, designed for medium sized inputs, is a new variant of quicksort with fast high-quality pivot selection. At the same time asymptotic analysis provides performance guarantees and guides the selection and configuration of the algorithms. We validate these hypotheses using extensive experiments on 7 algorithms, 10 input distributions, up to 262144 cores, and varying input sizes over 9 orders of magnitude. For difficult input distributions, our algorithms are the only ones that work at all. For all but the largest input sizes, we are the first to perform experiments on such large machines at all and our algorithms significantly outperform the ones one would conventionally have considered.

preprint2016arXiv

Accelerating Local Search for the Maximum Independent Set Problem

Computing high-quality independent sets quickly is an important problem in combinatorial optimization. Several recent algorithms have shown that kernelization techniques can be used to find exact maximum independent sets in medium-sized sparse graphs, as well as high-quality independent sets in huge sparse graphs that are intractable for exact (exponential-time) algorithms. However, a major drawback of these algorithms is that they require significant preprocessing overhead, and therefore cannot be used to find a high-quality independent set quickly. In this paper, we show that performing simple kernelization techniques in an online fashion significantly boosts the performance of local search, and is much faster than pre-computing a kernel using advanced techniques. In addition, we show that cutting high-degree vertices can boost local search performance even further, especially on huge (sparse) complex networks. Our experiments show that we can drastically speed up the computation of large independent sets compared to other state-of-the-art algorithms, while also producing results that are very close to the best known solutions.

preprint2016arXiv

Concurrent Hash Tables: Fast and General?(!)

Concurrent hash tables are one of the most important concurrent data structures with numerous applications. Since hash table accesses can dominate the execution time of the overall application, we need implementations that achieve good speedup. Unfortunately, currently available concurrent hashing libraries turn out to be far away from this requirement in particular when contention on some elements occurs. Our starting point for better performing data structures is a fast and simple lock-free concurrent hash table based on linear probing that is limited to word-sized key-value types and does not support dynamic size adaptation. We explain how to lift these limitations in a provably scalable way and demonstrate that dynamic growing has a performance overhead comparable to the same generalization in sequential hash tables. We perform extensive experiments comparing the performance of our implementations with six of the most widely used concurrent hash tables. Ours are considerably faster than the best algorithms with similar restrictions and an order of magnitude faster than the best more general tables. In some extreme cases, the difference even approaches four orders of magnitude.

preprint2016arXiv

Engineering a Distributed Full-Text Index

We present a distributed full-text index for big data applications in a distributed environment. Our index can answer different types of pattern matching queries (existential, counting and enumeration). We perform experiments on inputs up to 100 GiB using up to 512 processors, and compare our index with the distributed suffix array by Arroyuelo et al. [Parall. Comput. 40(9): 471--495, 2014]. The result is that our index answers counting queries up to 5.5 times faster than the distributed suffix array, while using about the same space. We also provide a succinct variant of our index that uses only one third of the memory compared with our non-succinct variant, at the expense of only 20% slower query times.

preprint2016arXiv

Fast Parallel Operations on Search Trees

Using (a,b)-trees as an example, we show how to perform a parallel split with logarithmic latency and parallel join, bulk updates, intersection, union (or merge), and (symmetric) set difference with logarithmic latency and with information theoretically optimal work. We present both asymptotically optimal solutions and simplified versions that perform well in practice - they are several times faster than previous implementations.

preprint2016arXiv

Generating Semi-Synthetic Validation Benchmarks for Embryomics

Systematic validation is an essential part of algorithm development. The enormous dataset sizes and the complexity observed in many recent time-resolved 3D fluorescence microscopy imaging experiments, however, prohibit a comprehensive manual ground truth generation. Moreover, existing simulated benchmarks in this field are often too simple or too specialized to sufficiently validate the observed image analysis problems. We present a new semi-synthetic approach to generate realistic 3D+t benchmarks that combines challenging cellular movement dynamics of real embryos with simulated fluorescent nuclei and artificial image distortions including various parametrizable options like cell numbers, acquisition deficiencies or multiview simulations. We successfully applied the approach to simulate the development of a zebrafish embryo with thousands of cells over 14 hours of its early existence.

preprint2016arXiv

Scalable Generation of Scale-free Graphs

We explain how massive instances of scale-free graphs following the Barabasi-Albert model can be generated very quickly in an embarrassingly parallel way. This makes this popular model available for studying big data graph problems. As a demonstration, we generated a Petaedge graph in less than an hour.

preprint2016arXiv

Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

We present the design and a first performance evaluation of Thrill -- a prototype of a general purpose big data processing framework with a convenient data-flow style programming interface. Thrill is somewhat similar to Apache Spark and Apache Flink with at least two main differences. First, Thrill is based on C++ which enables performance advantages due to direct native code compilation, a more cache-friendly memory layout, and explicit memory management. In particular, Thrill uses template meta-programming to compile chains of subsequent local operations into a single binary routine without intermediate buffering and with minimal indirections. Second, Thrill uses arrays rather than multisets as its primary data structure which enables additional operations like sorting, prefix sums, window scans, or combining corresponding fields of several arrays (zipping). We compare Thrill with Apache Spark and Apache Flink using five kernels from the HiBench suite. Thrill is consistently faster and often several times faster than the other frameworks. At the same time, the source codes have a similar level of simplicity and abstraction

preprint2015arXiv

A Bulk-Parallel Priority Queue in External Memory with STXXL

We propose the design and an implementation of a bulk-parallel external memory priority queue to take advantage of both shared-memory parallelism and high external memory transfer speeds to parallel disks. To achieve higher performance by decoupling item insertions and extractions, we offer two parallelization interfaces: one using "bulk" sequences, the other by defining "limit" items. In the design, we discuss how to parallelize insertions using multiple heaps, and how to calculate a dynamic prediction sequence to prefetch blocks and apply parallel multiway merge for extraction. Our experimental results show that in the selected benchmarks the priority queue reaches 75% of the full parallel I/O bandwidth of rotational disks and and 65% of SSDs, or the speed of sorting in external memory when bounded by computation.

preprint2015arXiv

Advanced Multilevel Node Separator Algorithms

A node separator of a graph is a subset S of the nodes such that removing S and its incident edges divides the graph into two disconnected components of about equal size. In this work, we introduce novel algorithms to find small node separators in large graphs. With focus on solution quality, we introduce novel flow-based local search algorithms which are integrated in a multilevel framework. In addition, we transfer techniques successfully used in the graph partitioning field. This includes the usage of edge ratings tailored to our problem to guide the graph coarsening algorithm as well as highly localized local search and iterated multilevel cycles to improve solution quality even further. Experiments indicate that flow-based local search algorithms on its own in a multilevel framework are already highly competitive in terms of separator quality. Adding additional local search algorithms further improves solution quality. Our strongest configuration almost always outperforms competing systems while on average computing 10% and 62% smaller separators than Metis and Scotch, respectively.

preprint2015arXiv

Communication Efficient Algorithms for Top-k Selection Problems

We present scalable parallel algorithms with sublinear per-processor communication volume and low latency for several fundamental problems related to finding the most relevant elements in a set, for various notions of relevance: We begin with the classical selection problem with unsorted input. We present generalizations with locally sorted inputs, dynamic content (bulk-parallel priority queues), and multiple criteria. Then we move on to finding frequent objects and top-k sum aggregation. Since it is unavoidable that the output of these algorithms might be unevenly distributed over the processors, we also explain how to redistribute this data with minimal communication.

preprint2015arXiv

Finding Near-Optimal Independent Sets at Scale

The independent set problem is NP-hard and particularly difficult to solve in large sparse graphs. In this work, we develop an advanced evolutionary algorithm, which incorporates kernelization techniques to compute large independent sets in huge sparse networks. A recent exact algorithm has shown that large networks can be solved exactly by employing a branch-and-reduce technique that recursively kernelizes the graph and performs branching. However, one major drawback of their algorithm is that, for huge graphs, branching still can take exponential time. To avoid this problem, we recursively choose vertices that are likely to be in a large independent set (using an evolutionary approach), then further kernelize the graph. We show that identifying and removing vertices likely to be in large independent sets opens up the reduction space---which not only speeds up the computation of large independent sets drastically, but also enables us to compute high-quality independent sets on much larger instances than previously reported in the literature.

preprint2015arXiv

Graph Partitioning for Independent Sets

Computing maximum independent sets in graphs is an important problem in computer science. In this paper, we develop an evolutionary algorithm to tackle the problem. The core innovations of the algorithm are very natural combine operations based on graph partitioning and local search algorithms. More precisely, we employ a state-of-the-art graph partitioner to derive operations that enable us to quickly exchange whole blocks of given independent sets. To enhance newly computed offsprings we combine our operators with a local search algorithm. Our experimental evaluation indicates that we are able to outperform state-of-the-art algorithms on a variety of instances.

preprint2015arXiv

HordeSat: A Massively Parallel Portfolio SAT Solver

A simple yet successful approach to parallel satisfiability (SAT) solving is to run several different (a portfolio of) SAT solvers on the input problem at the same time until one solver finds a solution. The SAT solvers in the portfolio can be instances of a single solver with different configuration settings. Additionally the solvers can exchange information usually in the form of clauses. In this paper we investigate whether this approach is applicable in the case of massively parallel SAT solving. Our solver is intended to run on clusters with thousands of processors, hence the name HordeSat. HordeSat is a fully distributed portfolio-based SAT solver with a modular design that allows it to use any SAT solver that implements a given interface. HordeSat has a decentralized design and features hierarchical parallelism with interleaved communication and search. We experimentally evaluated it using all the benchmark problems from the application tracks of the 2011 and 2014 International SAT Competitions. The experiments demonstrate that HordeSat is scalable up to hundreds or even thousands of processors achieving significant speedups especially for hard instances.

preprint2015arXiv

Incorporating Road Networks into Territory Design

Given a set of basic areas, the territory design problem asks to create a predefined number of territories, each containing at least one basic area, such that an objective function is optimized. Desired properties of territories often include a reasonable balance, compact form, contiguity and small average journey times which are usually encoded in the objective function or formulated as constraints. We address the territory design problem by developing graph theoretic models that also consider the underlying road network. The derived graph models enable us to tackle the territory design problem by modifying graph partitioning algorithms and mixed integer programming formulations so that the objective of the planning problem is taken into account. We test and compare the algorithms on several real world instances.

preprint2015arXiv

k-way Hypergraph Partitioning via n-Level Recursive Bisection

We develop a multilevel algorithm for hypergraph partitioning that contracts the vertices one at a time. Using several caching and lazy-evaluation techniques during coarsening and refinement, we reduce the running time by up to two-orders of magnitude compared to a naive $n$-level algorithm that would be adequate for ordinary graph partitioning. The overall performance is even better than the widely used hMetis hypergraph partitioner that uses a classical multilevel algorithm with few levels. Aided by a portfolio-based approach to initial partitioning and adaptive budgeting of imbalance within recursive bipartitioning, we achieve very high quality. We assembled a large benchmark set with 310 hypergraphs stemming from application areas such VLSI, SAT solving, social networks, and scientific computing. We achieve significantly smaller cuts than hMetis and PaToH, while being faster than hMetis. Considerably larger improvements are observed for some instance classes like social networks, for bipartitioning, and for partitions with an allowed imbalance of 10%. The algorithm presented in this work forms the basis of our hypergraph partitioning framework KaHyPar (Karlsruhe Hypergraph Partitioning).

preprint2015arXiv

n-Level Hypergraph Partitioning

We develop a multilevel algorithm for hypergraph partitioning that contracts the vertices one at a time and thus allows very high quality. This includes a rating function that avoids nonuniform vertex weights, an efficient "semi-dynamic" hypergraph data structure, a very fast coarsening algorithm, and two new local search algorithms. One is a $k$-way hypergraph adaptation of Fiduccia-Mattheyses local search and gives high quality at reasonable cost. The other is an adaptation of size-constrained label propagation to hypergraphs. Comparisons with hMetis and PaToH indicate that the new algorithm yields better quality over several benchmark sets and has a running time that is comparable to hMetis. Using label propagation local search is several times faster than hMetis and gives better quality than PaToH for a VLSI benchmark set.

preprint2015arXiv

Operating Power Grids with Few Flow Control Buses

Future power grids will offer enhanced controllability due to the increased availability of power flow control units (FACTS). As the installation of control units in the grid is an expensive investment, we are interested in using few controllers to achieve high controllability. In particular, two questions arise: How many flow control buses are necessary to obtain globally optimal power flows? And if fewer flow control buses are available, what can we achieve with them? Using steady state IEEE benchmark data sets, we explore experimentally that already a small number of controllers placed at certain grid buses suffices to achieve globally optimal power flows. We present a graph-theoretic explanation for this behavior. To answer the second question we perform a set of experiments that explore the existence and costs of feasible power flow solutions at increased loads with respect to the number of flow control buses in the grid. We observe that adding a small number of flow control buses reduces the flow costs and extends the existence of feasible solutions at increased load.

preprint2015arXiv

Parallel Graph Partitioning for Complex Networks

Processing large complex networks like social networks or web graphs has recently attracted considerable interest. In order to do this in parallel, we need to partition them into pieces of about equal size. Unfortunately, previous parallel graph partitioners originally developed for more regular mesh-like networks do not work well for these networks. This paper addresses this problem by parallelizing and adapting the label propagation technique originally developed for graph clustering. By introducing size constraints, label propagation becomes applicable for both the coarsening and the refinement phase of multilevel graph partitioning. We obtain very high quality by applying a highly parallel evolutionary algorithm to the coarsened graph. The resulting system is both more scalable and achieves higher quality than state-of-the-art systems like ParMetis or PT-Scotch. For large complex networks the performance differences are very big. For example, our algorithm can partition a web graph with 3.3 billion edges in less than sixteen seconds using 512 cores of a high performance cluster while producing a high quality partition -- none of the competing systems can handle this graph on our system.

preprint2015arXiv

Practical Massively Parallel Sorting

Previous parallel sorting algorithms do not scale to the largest available machines, since they either have prohibitive communication volume or prohibitive critical path length. We describe algorithms that are a viable compromise and overcome this gap both in theory and practice. The algorithms are multi-level generalizations of the known algorithms sample sort and multiway mergesort. In particular our sample sort variant turns out to be very scalable. Some tools we develop may be of independent interest -- a simple, practical, and flexible sorting algorithm for small inputs working in logarithmic time, a near linear time optimal algorithm for solving a constrained bin packing problem, and an algorithm for data delivery, that guarantees a small number of message startups on each processor.

preprint2015arXiv

Recent Advances in Graph Partitioning

We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions.

preprint2015arXiv

Route Planning in Transportation Networks

We survey recent advances in algorithms for route planning in transportation networks. For road networks, we show that one can compute driving directions in milliseconds or less even at continental scale. A variety of techniques provide different trade-offs between preprocessing effort, space requirements, and query time. Some algorithms can answer queries in a fraction of a microsecond, while others can deal efficiently with real-time traffic. Journey planning on public transportation systems, although conceptually similar, is a significantly harder problem due to its inherent time-dependent and multicriteria nature. Although exact algorithms are fast enough for interactive queries on metropolitan transit systems, dealing with continent-sized instances requires simplifications or heavy preprocessing. The multimodal route planning problem, which seeks journeys combining schedule-based transportation (buses, trains) with unrestricted modes (walking, driving), is even harder, relying on approximate solutions even for metropolitan inputs.

preprint2014arXiv

(Semi-)External Algorithms for Graph Partitioning and Clustering

In this paper, we develop semi-external and external memory algorithms for graph partitioning and clustering problems. Graph partitioning and clustering are key tools for processing and analyzing large complex networks. We address both problems in the (semi-)external model by adapting the size-constrained label propagation technique. Our (semi-)external size-constrained label propagation algorithm can be used to compute graph clusterings and is a prerequisite for the (semi-)external graph partitioning algorithm. The algorithm is then used for both the coarsening and the refinement phase of a multilevel algorithm to compute graph partitions. Our algorithm is able to partition and cluster huge complex networks with billions of edges on cheap commodity machines. Experiments demonstrate that the semi-external graph partitioning algorithm is scalable and can compute high quality partitions in time that is comparable to the running time of an efficient internal memory implementation. A parallelization of the algorithm in the semi-external model further reduces running time.

preprint2014arXiv

Efficient Parallel and External Matching

We show that a simple algorithm for computing a matching on a graph runs in a logarithmic number of phases incurring work linear in the input size. The algorithm can be adapted to provide efficient algorithms in several models of computation, such as PRAM, External Memory, MapReduce and distributed memory models. Our CREW PRAM algorithm is the first O(log^2 n) time, linear work algorithm. Our experimental results indicate the algorithm's high speed and efficiency combined with good solution quality.

preprint2014arXiv

Engineering Parallel String Sorting

We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we first propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Then we focus on NUMA architectures, and develop parallel multiway LCP-merge and -mergesort to reduce the number of random memory accesses to remote nodes. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations. Comprehensive experiments on five current multi-core platforms are then reported and discussed. The experiments show that our implementations scale very well on real-world inputs and modern machines.

preprint2014arXiv

Faster Exact Search using Document Clustering

We show how full-text search based on inverted indices can be accelerated by clustering the documents without losing results (SeCluD -- SEarch with CLUstered Documents). We develop a fast multilevel clustering algorithm that explicitly uses query cost for conjunctive queries as an objective function. Depending on the inputs we get up to four times faster than non-clustered search. The resulting clusters are also useful for data compression and for distributing the work over many machines.

preprint2014arXiv

MultiQueues: Simpler, Faster, and Better Relaxed Concurrent Priority Queues

Priority queues with parallel access are an attractive data structure for applications like prioritized online scheduling, discrete event simulation, or branch-and-bound. However, a classical priority queue constitutes a severe bottleneck in this context, leading to very small throughput. Hence, there has been significant interest in concurrent priority queues with a somewhat relaxed semantics where deleted elements only need to be close to the minimum. In this paper we present a very simple approach based on multiple sequential priority queues. It turns out to outperform previous more complicated data structures while at the same time improving the quality of the returned elements.

preprint2014arXiv

Partitioning Complex Networks via Size-constrained Clustering

The most commonly used method to tackle the graph partitioning problem in practice is the multilevel approach. During a coarsening phase, a multilevel graph partitioning algorithm reduces the graph size by iteratively contracting nodes and edges until the graph is small enough to be partitioned by some other algorithm. A partition of the input graph is then constructed by successively transferring the solution to the next finer graph and applying a local search algorithm to improve the current solution. In this paper, we describe a novel approach to partition graphs effectively especially if the networks have a highly irregular structure. More precisely, our algorithm provides graph coarsening by iteratively contracting size-constrained clusterings that are computed using a label propagation algorithm. The same algorithm that provides the size-constrained clusterings can also be used during uncoarsening as a fast and simple local search algorithm. Depending on the algorithm's configuration, we are able to compute partitions of very high quality outperforming all competitors, or partitions that are comparable to the best competitor in terms of quality, hMetis, while being nearly an order of magnitude faster on average. The fastest configuration partitions the largest graph available to us with 3.3 billion edges using a single machine in about ten minutes while cutting less than half of the edges than the fastest competitor, kMetis.

preprint2014arXiv

PReaCH: A Fast Lightweight Reachability Index using Pruning and Contraction Hierarchies

We develop the data structure PReaCH (for Pruned Reachability Contraction Hierarchies) which supports reachability queries in a directed graph, i.e., it supports queries that ask whether two nodes in the graph are connected by a directed path. PReaCH adapts the contraction hierarchy speedup techniques for shortest path queries to the reachability setting. The resulting approach is surprisingly simple and guarantees linear space and near linear preprocessing time. Orthogonally to that, we improve existing pruning techniques for the search by gathering more information from a single DFS-traversal of the graph. PReaCH-indices significantly outperform previous data structures with comparable preprocessing cost. Methods with faster queries need significantly more preprocessing time in particular for the most difficult instances.

preprint2013arXiv

Parallel String Sample Sort

preprint2013arXiv

Transit Node Routing Reconsidered

Transit Node Routing (TNR) is a fast and exact distance oracle for road networks. We show several new results for TNR. First, we give a surprisingly simple implementation fully based on Contraction Hierarchies that speeds up preprocessing by an order of magnitude approaching the time for just finding a CH (which alone has two orders of magnitude larger query time). We also develop a very effective purely graph theoretical locality filter without any compromise in query times. Finally, we show that a specialization to the online many-to-one (or one-to-many) shortest path further speeds up query time by an order of magnitude. This variant even has better query time than the fastest known previous methods which need much more space.

preprint2012arXiv

Advanced Coarsening Schemes for Graph Partitioning

The graph partitioning problem is widely used and studied in many practical and theoretical applications. The multilevel strategies represent today one of the most effective and efficient generic frameworks for solving this problem on large-scale graphs. Most of the attention in designing the multilevel partitioning frameworks has been on the refinement phase. In this work we focus on the coarsening phase, which is responsible for creating structurally similar to the original but smaller graphs. We compare different matching- and AMG-based coarsening schemes, experiment with the algebraic distance between nodes, and demonstrate computational results on several classes of graphs that emphasize the running time and quality advantages of different coarsenings.

preprint2012arXiv

Think Locally, Act Globally: Perfectly Balanced Graph Partitioning

We present a novel local improvement scheme for the perfectly balanced graph partitioning problem. This scheme encodes local searches that are not restricted to a balance constraint into a model allowing us to find combinations of these searches maintaining balance by applying a negative cycle detection algorithm. We combine this technique with an algorithm to balance unbalanced solutions and integrate it into a parallel multi-level evolutionary algorithm, KaFFPaE, to tackle the problem. Overall, we obtain a system that is fast on the one hand and on the other hand is able to improve or reproduce most of the best known perfectly balanced partitioning results ever reported in the literature.

preprint2011arXiv

Distributed Evolutionary Graph Partitioning

We present a novel distributed evolutionary algorithm, KaFFPaE, to solve the Graph Partitioning Problem, which makes use of KaFFPa (Karlsruhe Fast Flow Partitioner). The use of our multilevel graph partitioner KaFFPa provides new effective crossover and mutation operators. By combining these with a scalable communication protocol we obtain a system that is able to improve the best known partitioning results for many inputs in a very short amount of time. For example, in Walshaw's well known benchmark tables we are able to improve or recompute 76% of entries for the tables with 1%, 3% and 5% imbalance.

preprint2011arXiv

Efficient Error-Correcting Geocoding

We study the problem of resolving a perhaps misspelled address of a location into geographic coordinates of latitude and longitude. Our data structure solves this problem within a few milliseconds even for misspelled and fragmentary queries. Compared to major geographic search engines such as Google or Bing we achieve results of significantly better quality.

preprint2011arXiv

Engineering Multilevel Graph Partitioning Algorithms

We present a multi-level graph partitioning algorithm using novel local improvement algorithms and global search strategies transferred from the multi-grid community. Local improvement algorithms are based max-flow min-cut computations and more localized FM searches. By combining these techniques, we obtain an algorithm that is fast on the one hand and on the other hand is able to improve the best known partitioning results for many inputs. For example, in Walshaw's well known benchmark tables we achieve 317 improvements for the tables 1%, 3% and 5% imbalance. Moreover, in 118 additional cases we have been able to reproduce the best cut in this benchmark.

preprint2010arXiv

Compressed Transmission of Route Descriptions

We present two methods to compress the description of a route in a road network, i.e., of a path in a directed graph. The first method represents a path by a sequence of via edges. The subpaths between the via edges have to be unique shortest paths. Instead of via edges also via nodes can be used, though this requires some simple preprocessing. The second method uses contraction hierarchies to replace subpaths of the original path by shortcuts. The two methods can be combined with each other. Also, we propose the application to mobile server based routing: We compute the route on a server which has access to the latest information about congestions for example. Then we transmit the computed route to the car using some mobile radio communication. There, we apply the compression to save costs and transmission time. If the compression works well, we can transmit routes even when the bandwidth is low. Although we have not evaluated our ideas with realistic data yet, they are quite promising.

preprint2010arXiv

Defining and Computing Alternative Routes in Road Networks

Every human likes choices. But today's fast route planning algorithms usually compute just a single route between source and target. There are beginnings to compute alternative routes, but this topic has not been studied thoroughly. Often, the aspect of meaningful alternative routes is neglected from a human point of view. We fill in this gap by suggesting mathematical definitions for such routes. As a second contribution we propose heuristics to compute them, as this is NP-hard in general.

preprint2010arXiv

Engineering a Scalable High Quality Graph Partitioner

We describe an approach to parallel graph partitioning that scales to hundreds of processors and produces a high solution quality. For example, for many instances from Walshaw's benchmark collection we improve the best known partitioning. We use the well known framework of multi-level graph partitioning. All components are implemented by scalable parallel algorithms. Quality improvements compared to previous systems are due to better prioritization of edges to be contracted, better approximation algorithms for identifying matchings, better local search heuristics, and perhaps most notably, a parallelization of the FM local search algorithm that works more locally than previous approaches.

preprint2010arXiv

Faster Radix Sort via Virtual Memory and Write-Combining

Sorting algorithms are the deciding factor for the performance of common operations such as removal of duplicates or database sort-merge joins. This work focuses on 32-bit integer keys, optionally paired with a 32-bit value. We present a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort. Taking advantage of virtual memory and making use of write-combining yields a per-pass throughput corresponding to at least 88 % of the system's peak memory bandwidth. Our implementation outperforms Intel's recently published radix sort by a factor of 1.5. It also compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer overhead is included. These results indicate that scalar, bandwidth-sensitive sorting algorithms remain competitive on current architectures. Various other memory-intensive applications can benefit from the techniques described herein.

preprint2010arXiv

Improved Fast Similarity Search in Dictionaries

We engineer an algorithm to solve the approximate dictionary matching problem. Given a list of words $\mathcal{W}$, maximum distance $d$ fixed at preprocessing time and a query word $q$, we would like to retrieve all words from $\mathcal{W}$ that can be transformed into $q$ with $d$ or less edit operations. We present data structures that support fault tolerant queries by generating an index. On top of that, we present a generalization of the method that eases memory consumption and preprocessing time significantly. At the same time, running times of queries are virtually unaffected. We are able to match in lists of hundreds of thousands of words and beyond within microseconds for reasonable distances.

preprint2010arXiv

n-Level Graph Partitioning

We present a multi-level graph partitioning algorithm based on the extreme idea to contract only a single edge on each level of the hierarchy. This obviates the need for a matching algorithm and promises very good partitioning quality since there are very few changes between two levels. Using an efficient data structure and new flexible ways to break local search improvements early, we obtain an algorithm that scales to large inputs and produces the best known partitioning results for many inputs. For example, in Walshaw's well known benchmark tables we achieve 155 improvements dominating the entries for large graphs.

Peter Sanders

What is connected

Connect this record

See the researcher in context

Building this map preview

54 published item(s)

Fast Succinct Retrieval and Approximate Membership using Ribbon

More Recent Advances in (Hyper)Graph Partitioning

Parallel Flow-Based Hypergraph Partitioning

Scalable SAT Solving in the Cloud

Vectorized and performance-portable Quicksort

Weighted Random Sampling on GPUs

Engineering In-place (Shared-memory) Sorting Algorithms

Communication-Efficient (Weighted) Reservoir Sampling from Fully Distributed Data Streams

Communication-Efficient String Sorting

Connecting MapReduce Computations to Realistic Machine Models

KaHIP v3.00 -- Karlsruhe High Quality Partitioning -- User Guide

Recent Advances in Scalable Network Generation

Robust Massively Parallel Sorting

Accelerating Local Search for the Maximum Independent Set Problem

Concurrent Hash Tables: Fast and General?(!)

Engineering a Distributed Full-Text Index

Fast Parallel Operations on Search Trees

Generating Semi-Synthetic Validation Benchmarks for Embryomics

Scalable Generation of Scale-free Graphs

Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

A Bulk-Parallel Priority Queue in External Memory with STXXL

Advanced Multilevel Node Separator Algorithms

Communication Efficient Algorithms for Top-k Selection Problems

Finding Near-Optimal Independent Sets at Scale

Graph Partitioning for Independent Sets

HordeSat: A Massively Parallel Portfolio SAT Solver

Incorporating Road Networks into Territory Design

k-way Hypergraph Partitioning via n-Level Recursive Bisection

n-Level Hypergraph Partitioning

Operating Power Grids with Few Flow Control Buses

Parallel Graph Partitioning for Complex Networks

Practical Massively Parallel Sorting

Recent Advances in Graph Partitioning

Route Planning in Transportation Networks

(Semi-)External Algorithms for Graph Partitioning and Clustering

Efficient Parallel and External Matching

Engineering Parallel String Sorting

Faster Exact Search using Document Clustering

MultiQueues: Simpler, Faster, and Better Relaxed Concurrent Priority Queues

Partitioning Complex Networks via Size-constrained Clustering

PReaCH: A Fast Lightweight Reachability Index using Pruning and Contraction Hierarchies

Parallel String Sample Sort

Transit Node Routing Reconsidered

Advanced Coarsening Schemes for Graph Partitioning

Think Locally, Act Globally: Perfectly Balanced Graph Partitioning

Distributed Evolutionary Graph Partitioning

Efficient Error-Correcting Geocoding

Engineering Multilevel Graph Partitioning Algorithms

Compressed Transmission of Route Descriptions

Defining and Computing Alternative Routes in Road Networks

Engineering a Scalable High Quality Graph Partitioner

Faster Radix Sort via Virtual Memory and Write-Combining

Improved Fast Similarity Search in Dictionaries

n-Level Graph Partitioning