Source author record

Jesper Larsson Träff

Jesper Larsson Träff appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Data Structures and Algorithms Computational Complexity

Catalog footprint

What is connected

19works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Doubly-pipelined, Dual-root Reduction-to-all Algorithm and Implementation

We discuss a simple, binary tree-based algorithm for the collective allreduce (reduction-to-all, MPI_Allreduce) operation for parallel systems consisting of $p$ suitably interconnected processors. The algorithm can be doubly pipelined to exploit bidirectional (telephone-like) communication capabilities of the communication system. In order to make the algorithm more symmetric, the processors are organized into two rooted trees with communication between the two roots. For each pipeline block, each non-leaf processor takes three communication steps, consisting in receiving and sending from and to the two children, and sending and receiving to and from the root. In a round-based, uniform, linear-cost communication model in which simultaneously sending and receiving $n$ data elements takes time $α+βn$ for system dependent constants $α$ (communication start-up latency) and $β$ (time per element), the time for the allreduce operation on vectors of $m$ elements is $O(\log p+\sqrt{m\log p})+3βm$ by suitable choice of the pipeline block size. We compare the performance of an implementation in MPI to similar reduce followed by broadcast algorithms, and the native MPI_Allreduce collective on a modern, small $36\times 32$ processor cluster. With proper choice of the number of pipeline blocks, it is possible to achieve better performance than pipelined algorithms that do not exploit bidirectional communication.

preprint2020arXiv

$k$-ported vs. $k$-lane Broadcast, Scatter, and Alltoall Algorithms

In $k$-ported message-passing systems, a processor can simultaneously receive $k$ different messages from $k$ other processors, and send $k$ different messages to $k$ other processors that may or may not be different from the processors from which messages are received. Modern clustered systems may not have such capabilities. Instead, compute nodes consisting of $n$ processors can simultaneously send and receive $k$ messages from other nodes, by letting $k$ processors on the nodes concurrently send and receive at most one message. We pose the question of how to design good algorithms for this $k$-lane model, possibly by adapting algorithms devised for the traditional $k$-ported model. We discuss and compare a number of (non-optimal) $k$-lane algorithms for the broadcast, scatter and alltoall collective operations (as found in, e.g., MPI), and experimentally evaluate these on a small $36\times 32$-node cluster with a dual OmniPath network (corresponding to $k=2$). Results are preliminary.

preprint2020arXiv

Decomposing Collectives for Exploiting Multi-lane Communication

Many modern, high-performance systems increase the cumulated node-bandwidth by offering more than a single communication network and/or by having multiple connections to the network. Efficient algorithms and implementations for collective operations as found in, e.g., MPI must be explicitly designed for such multi-lane capabilities. We discuss a model for the design of multi-lane algorithms, and in particular give a recipe for converting any standard, one-ported, (pipelined) communication tree algorithm into a multi-lane algorithm that can effectively use $k$ lanes simultaneously. We first examine the problem from the perspective of \emph{self-consistent performance guidelines}, and give simple, \emph{full-lane, mock-up implementations} of the MPI broadcast, reduction, scan, gather, scatter, allgather, and alltoall operations using only similar operations of the given MPI library itself in such a way that multi-lane capabilities can be exploited. These implementations which rely on a decomposition of the communication domain into communicators for nodes and lanes are full-fledged and readily usable implementations of the MPI collectives. The mock-up implementations, contrary to expectation, in many cases show surprising performance improvements with different MPI libraries on a small 36-node dual-socket, dual-lane Intel OmniPath cluster, indicating severe problems with the native MPI library implementations. Our full-lane implementations are in many cases considerably more than a factor of two faster than the corresponding MPI collectives. We see similar results on the larger Vienna Scientific Cluster, VSC-3. These experiments indicate considerable room for improvement of the MPI collectives in current libraries including more efficient use of multi-lane communication.

preprint2020arXiv

Efficient Process-to-Node Mapping Algorithms for Stencil Computations

Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to Cartesian grids. By thoroughly exploiting the inherently present structure in this type of problem, we devise three novel distributed algorithms that are able to handle arbitrary stencil communication patterns effectively. We analyze the expected performance of our algorithms based on an abstract model of inter- and intra-node communication. An extensive experimental evaluation on several HPC machines shows that our algorithms are up to two orders of magnitude faster in running time than a (sequential) high-quality general graph mapping tool, while obtaining similar results in communication performance. Furthermore, our algorithms also achieve significantly better mapping quality compared to previous state-of-the-art Cartesian grid mapping algorithms. This results in up to a threefold performance improvement of an MPI_Neighbor_alltoall exchange operation. Our new algorithms can be used to implement the MPI_Cart_create functionality.

preprint2020arXiv

High-Quality Hierarchical Process Mapping

Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known an important task is then to map the blocks of the partition onto the processors such that the overall communication cost is reduced. We present novel multilevel algorithms that integrate graph partitioning and process mapping. Important ingredients of our algorithm include fast label propagation, more localized local search, initial partitioning, as well as a compressed data structure to compute processor distances without storing a distance matrix. Experiments indicate that our algorithms speed up the overall mapping process and, due to the integrated multilevel approach, also find much better solutions in practice. For example, one configuration of our algorithm yields better solutions than the previous state-of-the-art in terms of mapping quality while being a factor 62 faster. Compared to the currently fastest iterated multilevel mapping algorithm Scotch, we obtain 16% better solutions while investing slightly more running time.

preprint2016arXiv

Benchmarking Concurrent Priority Queues: Performance of k-LSM and Related Data Structures

A number of concurrent, relaxed priority queues have recently been proposed and implemented. Results are commonly reported for a throughput benchmark that uses a uniform distribution of keys drawn from a large integer range, and mostly for single systems. We have conducted more extensive benchmarking of three recent, relaxed priority queues on four different types of systems with different key ranges and distributions. While we can show superior throughput and scalability for our own k-LSM priority queue for the uniform key distribution, the picture changes drastically for other distributions, both with respect to achieved throughput and relative merit of the priority queues. The throughput benchmark alone is thus not sufficient to characterize the performance of concurrent priority queues. Our benchmark code and k-LSM priority queue are publicly available to foster future comparison.

preprint2016arXiv

Message-Combining Algorithms for Isomorphic, Sparse Collective Communication

Isomorphic (sparse) collective communication is a form of collective communication in which all involved processes communicate in small, identically structured neighborhoods of other processes. Isomorphic neighborhoods are defined via an embedding of the processes in a regularly structured topology, e.g., $d$-dimensional torus, which may correspond to the physical communication network of the underlying system. Isomorphic collective communication is useful for implementing stencil and other regular, sparse distributed computations, where the assumption that all processes behave (almost) symmetrically is justified. In this paper, we show how efficient message-combining communication schedules for isomorphic, sparse collective communication can easily and efficiently be computed by purely local computations. We give schemes for \emph{isomorphic \alltoall} and \emph{\allgather} communication that reduce the number of communication rounds and thereby the communication latency from $s$ to at most $Nd$, for neighborhoods consisting of $s$ processes with the (small) factor $N$ depending on the structure of the neighborhood and the capabilities of the communication system. Using these schedules, we give \emph{zero-copy implementations} of the isomorphic collectives using MPI and its derived datatypes to eliminate explicit, process-local copy operations. By benchmarking the collective communication algorithms against straightforward implementations and against the corresponding MPI neighborhood collectives, we document significant latency improvements of our implementations for block sizes of up to a few kilobytes. We discuss further optimizations for computing even better schedules, some of which have been implemented and benchmarked.

preprint2016arXiv

MPI Derived Datatypes: Performance Expectations and Status Quo

We examine natural expectations on communication performance using MPI derived datatypes in comparison to the baseline, "raw" performance of communicating simple, non-contiguous data layouts. We show that common MPI libraries sometimes violate these datatype performance expectations, and discuss reasons why this happens, but also show cases where MPI libraries perform well. Our findings are in many ways surprising and disappointing. First, the performance of derived datatypes is sometimes worse than the semantically equivalent packing and unpacking using the corresponding MPI functionality. Second, the communication performance equivalence stated in the MPI standard between a single contiguous datatype and the repetition of its constituent datatype does not hold universally. Third, the heuristics that are typically employed by MPI libraries at type-commit time are insufficient to enforce natural performance guidelines, and better type normalization heuristics may have a significant performance impact. We show cases where all the MPI type constructors are necessary to achieve the expected performance for certain data layouts. We describe our benchmarking approach to verify the datatype performance guidelines, and present extensive verification results for different MPI libraries.

preprint2016arXiv

PGMPI: Automatically Verifying Self-Consistent MPI Performance Guidelines

The Message Passing Interface (MPI) is the most commonly used application programming interface for process communication on current large-scale parallel systems. Due to the scale and complexity of modern parallel architectures, it is becoming increasingly difficult to optimize MPI libraries, as many factors can influence the communication performance. To assist MPI developers and users, we propose an automatic way to check whether MPI libraries respect self-consistent performance guidelines for collective communication operations. We introduce the PGMPI framework to detect violations of performance guidelines through benchmarking. Our experimental results show that PGMPI can pinpoint undesired and often unexpected performance degradations of collective MPI operations. We demonstrate how to overcome performance issues of several libraries by adapting the algorithmic implementations of their respective collective MPI calls.

preprint2016arXiv

The Shortest Path Problem with Edge Information Reuse is NP-Complete

We show that the following variation of the single-source shortest path problem is NP-complete. Let a weighted, directed, acyclic graph $G=(V,E,w)$ with source and sink vertices $s$ and $t$ be given. Let in addition a mapping $f$ on $E$ be given that associates information with the edges (e.g., a pointer), such that $f(e)=f(e')$ means that edges $e$ and $e'$ carry the same information; for such edges it is required that $w(e)=w(e')$. The length of a simple $st$ path $U$ is the sum of the weights of the edges on $U$ but edges with $f(e)=f(e')$ are counted only once. The problem is to determine a shortest such $st$ path. We call this problem the \emph{edge information reuse shortest path problem}. It is NP-complete by reduction from 3SAT.

preprint2015arXiv

Polynomial-time Construction of Optimal Tree-structured Communication Data Layout Descriptions

We show that the problem of constructing tree-structured descriptions of data layouts that are optimal with respect to space or other criteria from given sequences of displacements, can be solved in polynomial time. The problem is relevant for efficient compiler and library support for communication of noncontiguous data, where tree-structured descriptions with low-degree nodes and small index arrays are beneficial for the communication soft- and hardware. An important example is the Message-Passing Interface (MPI) which has a mechanism for describing arbitrary data layouts as trees using a set of increasingly general constructors. Our algorithm shows that the so-called MPI datatype reconstruction problem by trees with the full set of MPI constructors can be solved optimally in polynomial time, refuting previous conjectures that the problem is NP-hard. Our algorithm can handle further, natural constructors, currently not found in MPI. Our algorithm is based on dynamic programming, and requires the solution of a series of shortest path problems on an incrementally built, directed, acyclic graph. The algorithm runs in $O(n^4)$ time steps and requires $O(n^2)$ space for input displacement sequences of length $n$.

preprint2015arXiv

The Lock-free $k$-LSM Relaxed Priority Queue

Priority queues are data structures which store keys in an ordered fashion to allow efficient access to the minimal (maximal) key. Priority queues are essential for many applications, e.g., Dijkstra's single-source shortest path algorithm, branch-and-bound algorithms, and prioritized schedulers. Efficient multiprocessor computing requires implementations of basic data structures that can be used concurrently and scale to large numbers of threads and cores. Lock-free data structures promise superior scalability by avoiding blocking synchronization primitives, but the \emph{delete-min} operation is an inherent scalability bottleneck in concurrent priority queues. Recent work has focused on alleviating this obstacle either by batching operations, or by relaxing the requirements to the \emph{delete-min} operation. We present a new, lock-free priority queue that relaxes the \emph{delete-min} operation so that it is allowed to delete \emph{any} of the $ρ+1$ smallest keys, where $ρ$ is a runtime configurable parameter. Additionally, the behavior is identical to a non-relaxed priority queue for items added and removed by the same thread. The priority queue is built from a logarithmic number of sorted arrays in a way similar to log-structured merge-trees. We experimentally compare our priority queue to recent state-of-the-art lock-free priority queues, both with relaxed and non-relaxed semantics, showing high performance and good scalability of our approach.

preprint2014arXiv

An improved, easily computable combinatorial lower bound for weighted graph bipartitioning

There has recently been much progress on exact algorithms for the (un)weighted graph (bi)partitioning problem using branch-and-bound and related methods. In this note we present and improve an easily computable, purely combinatorial lower bound for the weighted bipartitioning problem. The bound is computable in $O(n\log n+m)$ time steps for weighted graphs with $n$ vertices and $m$ edges. In the branch-and-bound setting, the bound for each new subproblem can be updated in $O(n+(m/n)\log n)$ time steps amortized over a series of $n$ branching steps; a rarely triggered tightening of the bound requires search on the graph of unassigned vertices and can take from $O(n+m)$ to $O(nm+n^2\log n)$ steps depending on implementation and possible bound quality. Representing a subproblem uses $O(n)$ space. Although the bound is weak, we believe that it can be advantageous in a parallel setting to be able to generate many subproblems fast, possibly out-weighting the advantages of tighter, but much more expensive (algebraic, spectral, flow) lower bounds. We use a recent priority task-scheduling framework for giving a parallel implementation, and show the relative improvements in bound quality and solution speed by the different contributions of the lower bound. A detailed comparison with standardized input graphs to other lower bounds and frameworks is pending. Detailed investigations of branching and subproblem selection rules are likewise not the focus here, but various options are discussed.

preprint2013arXiv

A Note on (Parallel) Depth- and Breadth-First Search by Arc Elimination

This note recapitulates an algorithmic observation for ordered Depth-First Search (DFS) in directed graphs that immediately leads to a parallel algorithm with linear speed-up for a range of processors for non-sparse graphs. The note extends the approach to ordered Breadth-First Search (BFS). With $p$ processors, both DFS and BFS algorithms run in $O(m/p+n)$ time steps on a shared-memory parallel machine allowing concurrent reading of locations, e.g., a CREW PRAM, and have linear speed-up for $p\leq m/n$. Both algorithms need $n$ synchronization steps.

preprint2013arXiv

Configurable Strategies for Work-stealing

Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. For instance, they do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, the actual task execution order is typically determined by the underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system. We introduce scheduling strategies to enable applications to dynamically provide hints to the task-scheduling system on the nature of specific tasks. Scheduling strategies can be used to independently control both local task execution order as well as steal order. In contrast to conventional scheduling policies that are normally global in scope, strategies allow the scheduler to apply optimizations on individual tasks. This flexibility greatly improves composability as it allows the scheduler to apply different, specific scheduling choices for different parts of applications simultaneously. We present a number of benchmarks that highlight diverse, beneficial effects that can be achieved with scheduling strategies. Some benchmarks (branch-and-bound, single-source shortest path) show that prioritization of tasks can reduce the total amount of work compared to standard work-stealing execution order. For other benchmarks (triangle strip generation) qualitatively better results can be achieved in shorter time. Other optimizations, such as dynamic merging of tasks or stealing of half the work, instead of half the tasks, are also shown to improve performance. Composability is demonstrated by examples that combine different strategies, both within the same kernel (prefix sum) as well as when scheduling multiple kernels (prefix sum and unbalanced tree search).

preprint2013arXiv

On the State and Importance of Reproducible Experimental Research in Parallel Computing

Computer science is also an experimental science. This is particularly the case for parallel computing, which is in a total state of flux, and where experiments are necessary to substantiate, complement, and challenge theoretical modeling and analysis. Here, experimental work is as important as are advances in theory, that are indeed often driven by the experimental findings. In parallel computing, scientific contributions presented in research articles are therefore often based on experimental data, with a substantial part devoted to presenting and discussing the experimental findings. As in all of experimental science, experiments must be presented in a way that makes reproduction by other researchers possible, in principle. Despite appearance to the contrary, we contend that reproducibility plays a small role, and is typically not achieved. As can be found, articles often do not have a sufficiently detailed description of their experiments, and do not make available the software used to obtain the claimed results. As a consequence, parallel computational results are most often impossible to reproduce, often questionable, and therefore of little or no scientific value. We believe that the description of how to reproduce findings should play an important part in every serious, experiment-based parallel computing research article. We aim to initiate a discussion of the reproducibility issue in parallel computing, and elaborate on the importance of reproducible research for (1) better and sounder technical/scientific papers, (2) a sounder and more efficient review process and (3) more effective collective work. This paper expresses our current view on the subject and should be read as a position statement for discussion and future work. We do not consider the related (but no less important) issue of the quality of the experimental design.

preprint2013arXiv

Perfectly load-balanced, optimal, stable, parallel merge

We present a simple, work-optimal and synchronization-free solution to the problem of stably merging in parallel two given, ordered arrays of m and n elements into an ordered array of m+n elements. The main contribution is a new, simple, fast and direct algorithm that determines, for any prefix of the stably merged output sequence, the exact prefixes of each of the two input sequences needed to produce this output prefix. More precisely, for any given index (rank) in the resulting, but not yet constructed output array representing an output prefix, the algorithm computes the indices (co-ranks) in each of the two input arrays representing the required input prefixes without having to merge the input arrays. The co-ranking algorithm takes O(log min(m,n)) time steps. The algorithm is used to devise a perfectly load-balanced, stable, parallel merge algorithm where each of p processing elements has exactly the same number of input elements to merge. Compared to other approaches to the parallel merge problem, our algorithm is considerably simpler and can be faster up to a factor of two. Compared to previous algorithms for solving the co-ranking problem, the algorithm given here is direct and maintains stability in the presence of repeated elements at no extra space or time cost. When the number of processing elements p does not exceed (m+n)/log min(m,n), the parallel merge algorithm has optimal speedup. It is easy to implement on both shared and distributed memory parallel systems.

preprint2012arXiv

Simplified, stable parallel merging

This note makes an observation that significantly simplifies a number of previous parallel, two-way merge algorithms based on binary search and sequential merge in parallel. First, it is shown that the additional merge step of distinguished elements as found in previous algorithms is not necessary, thus simplifying the implementation and reducing constant factors. Second, by fixating the requirements to the binary search, the merge algorithm becomes stable, provided that the sequential merge subroutine is stable. The stable, parallel merge algorithm can easily be used to implement a stable, parallel merge sort. For ordered sequences with $n$ and $m$ elements, $m\leq n$, the simplified merge algorithm runs in $O(n/p+\log n)$ operations using $p$ processing elements. It can be implemented on an EREW PRAM, but since it requires only a single synchronization step, it is also a candidate for implementation on other parallel, shared-memory computers.

preprint2010arXiv

Work-stealing for mixed-mode parallelism by deterministic team-building

We show how to extend classical work-stealing to deal also with data parallel tasks that can require any number of threads r >= 1 for their execution. We explain in detail the so introduced idea of work-stealing with deterministic team-building which in a natural way generalizes classical work-stealing. A prototype C++ implementation of the generalized work-stealing algorithm has been given and is briefly described. Building on this, a serious, well-known contender for a best parallel Quicksort algorithm has been implemented, which naturally relies on both task and data parallelism. For instance, sorting 2^27-1 randomly generated integers we could improve the speed-up from 5.1 to 8.7 on a 32-core Intel Nehalem EX system, being consistently better than the tuned, task-parallel Cilk++ system.

Jesper Larsson Träff

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

A Doubly-pipelined, Dual-root Reduction-to-all Algorithm and Implementation

$k$-ported vs. $k$-lane Broadcast, Scatter, and Alltoall Algorithms

Decomposing Collectives for Exploiting Multi-lane Communication

Efficient Process-to-Node Mapping Algorithms for Stencil Computations

High-Quality Hierarchical Process Mapping

Benchmarking Concurrent Priority Queues: Performance of k-LSM and Related Data Structures

Message-Combining Algorithms for Isomorphic, Sparse Collective Communication

MPI Derived Datatypes: Performance Expectations and Status Quo

PGMPI: Automatically Verifying Self-Consistent MPI Performance Guidelines

The Shortest Path Problem with Edge Information Reuse is NP-Complete

Polynomial-time Construction of Optimal Tree-structured Communication Data Layout Descriptions

The Lock-free $k$-LSM Relaxed Priority Queue

An improved, easily computable combinatorial lower bound for weighted graph bipartitioning

A Note on (Parallel) Depth- and Breadth-First Search by Arc Elimination

Configurable Strategies for Work-stealing

On the State and Importance of Reproducible Experimental Research in Parallel Computing

Perfectly load-balanced, optimal, stable, parallel merge

Simplified, stable parallel merging

Work-stealing for mixed-mode parallelism by deterministic team-building