Source author record

Paul Springer

Paul Springer appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Mathematical Software Performance Computational Engineering, Finance, and Science cond-mat.mtrl-sci hep-ph physics.chem-ph physics.comp-ph

Catalog footprint

What is connected

5works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

TTC: A high-performance Compiler for Tensor Transpositions

We present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.

preprint2016arXiv

TTC: A Tensor Transposition Compiler for Multiple Architectures

We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to $8\times$ and $32\times$, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.

preprint2015arXiv

A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Simulations of Interfacial Systems

This master thesis introduces the idea of dynamic cutoffs in molecular dynamics simulations, based on the distance between particles and the interface, and presents a solution for detecting interfaces in real-time. Our dynamic cutoff method (DCM) exhibits a linear-time complexity as well as nearly ideal weak and strong scaling. The DCM is tailored for massively parallel architectures and for large interfacial systems with millions of particles. We implemented the DCM as part of the LAMMPS open-source molecular dynamics package and demonstrate the nearly ideal weak- and strong-scaling behavior of this method on an IBM BlueGene/Q supercomputer. Our results for a liquid/vapor system consisting of Lennard-Jones particles show that the accuracy of DCM is comparable to that of the traditional particle-particle particle- mesh (PPPM) algorithm. The performance comparison indicates that DCM is preferable for large systems due to the limited scaling of FFTs within the PPPM algorithm. Moreover, the DCM requires the interface to be identified every other MD timestep. As a consequence, this thesis also presents an interface detection method which is (1) applicable in real time; (2) parallelizable; and (3) scales linearly with respect to the number of particles.

preprint2015arXiv

O(2)-scaling in finite and infinite volume

The exact nature of the chiral phase transition in QCD is still under investigation. In $N_f=2$ and $N_f=(2+1)$ lattice simulations with staggered fermions the expected O($N$)-scaling behavior was observed. However, it is still not clear whether this behavior falls into the O(2) or O(4) universality class. To resolve this issue, a careful scaling and finite-size scaling analysis of the lattice results is needed. We use a functional renormalization group to perform a new investigation of the finite-size scaling regions in O(2)- and O(4)-models. We also investigate the behavior of the critical fluctuations by means of the $4^{\text{th}}$-order Binder cumulant. The finite-size analysis of this quantity provides an additional way for determining the universality class of the chiral phase transition in lattice QCD.

preprint2014arXiv

Multilevel Summation for Dispersion: A Linear-Time Algorithm for $r^{-6}$ Potentials

We have extended the multilevel summation (MLS) method, originally developed to evaluate long-range Coulombic interactions in molecular dynamics (MD) simulations [Skeel et al., J. Comput. Chem., 23, 673 (2002)], to handle dispersion interactions. While dispersion potentials are formally short-ranged, accurate calculation of forces and energies in interfacial and inhomogeneous systems require long-range methods. The MLS method offers some significant advantages compared to the particle-particle particle-mesh and smooth particle mesh Ewald methods. Unlike mesh-based Ewald methods, MLS does not use fast Fourier transforms and is thus not limited by communication and bandwidth concerns. In addition, it scales linearly in the number of particles, as compared with the $\mathcal{O}(N \log N)$ complexity of the mesh-based Ewald methods. While the structure of the MLS method is invariant for different potentials, every algorithmic step had to be adapted to accommodate the $r^{-6}$ form of the dispersion interactions. In addition, we have derived error bounds, similar to those obtained by Hardy for the electrostatic MLS [Hardy, Ph.D. thesis, University of Illinois at Urbana-Champaign (2006)]. Using a prototype implementation, we have demonstrated the linear scaling of the MLS method for dispersion, and present results establishing the accuracy and efficiency of the method.