Source author record

Andreas Klöckner

Andreas Klöckner appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.NA Mathematical Software Distributed, Parallel, and Cluster Computing Numerical Analysis Performance Programming Languages Software Engineering Computational Engineering, Finance, and Science cond-mat.mtrl-sci physics.class-ph

Catalog footprint

What is connected

18works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Canonicalization of Batched Einstein Summations for Tuning Retrieval

We present an algorithm for normalizing \emph{Batched Einstein Summation} expressions by mapping mathematically equivalent formulations to a unique normal form. Batches of einsums with the same Einstein notation that exhibit substantial data reuse appear frequently in finite element methods (FEM), numerical linear algebra, and computational chemistry. To effectively exploit this temporal locality for high performance, we consider groups of einsums in batched form. Representations of equivalent batched einsums may differ due to index renaming, permutations within the batch, and, due to the commutativity and associativity of multiplication operation. The lack of a canonical representation hinders the reuse of optimization and tuning knowledge in software systems. To this end, we develop a novel encoding of batched einsums as colored graphs and apply graph canonicalization to derive a normal form. In addition to the canonicalization algorithm, we propose a representation of einsums using functional array operands and provide a strategy to transfer transformations operating on the normal form to \emph{functional batched einsums} that exhibit the same normal form; crucial for fusing surrounding computations for memory bound einsums. We evaluate our approach against JAX, and observe a geomean speedup of $4.7\times$ for einsums from the TCCG benchmark suite and an FEM solver.

preprint2021arXiv

Finite elements for Helmholtz equations with a nonlocal boundary condition

Numerical resolution of exterior Helmholtz problems requires some approach to domain truncation. As an alternative to approximate nonreflecting boundary conditions and invocation of the Dirichlet-to-Neumann map, we introduce a new, nonlocal boundary condition. This condition is exact and requires the evaluation of layer potentials involving the free space Green's function. However, it seems to work in general unstructured geometry, and Galerkin finite element discretization leads to convergence under the usual mesh constraints imposed by Gårding-type inequalities. The nonlocal boundary conditions are readily approximated by fast multipole methods, and the resulting linear system can be preconditioned by the purely local operator involving transmission boundary conditions.

preprint2020arXiv

A Fast Algorithm with Error Bounds for Quadrature by Expansion

Quadrature by Expansion (QBX) is a quadrature method for approximating the value of the singular integrals encountered in the evaluation of layer potentials. It exploits the smoothness of the layer potential by forming locally-valid expansion which are then evaluated to compute the near or on-surface value of the integral. Recent work towards coupling of a Fast Multipole Method (FMM) to QBX yielded a first step towards the rapid evaluation of such integrals (and the solution of related integral equations), albeit with only empirically understood error behavior. In this paper, we improve upon this approach with a modified algorithm for which we give a comprehensive analysis of error and cost in the case of the Laplace equation in two dimensions. For the same levels of (user-specified) accuracy, the new algorithm empirically has cost-per-accuracy comparable to prior approaches. We provide experimental results to demonstrate scalability and numerical accuracy.

preprint2020arXiv

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance, parallel applications. In today's increasingly heterogeneous computing environment, this task must be accomplished efficiently across multiple architectures, including massively parallel coprocessors like GPUs. To address this challenge, we present an approach for constructing customizable, cross-machine performance models for GPU kernels, including a mechanism to automatically and symbolically gather performance-relevant kernel operation counts, a tool for formulating mathematical models using these counts, and a customizable parameterized collection of benchmark kernels used to calibrate models to GPUs in a black-box fashion. Our approach empowers a user to manage trade-offs between model accuracy, evaluation speed, and generalizability. A user can define a model and customize the calibration process, making it as simple or complex as desired, and as application-targeted or general as desired. To evaluate our approach, we demonstrate both linear and nonlinear models; each example models execution times for multiple variants of a particular computation: two matrix multiplication variants, four Discontinuous Galerkin (DG) differentiation operation variants, and two 2-D five-point finite difference stencil variants. For each variant, we present accuracy results on GPUs from multiple vendors and hardware generations. We view this customizable approach as a response to a central question in GPU performance modeling: how can we model GPU performance in a cost-explanatory fashion while maintaining accuracy, evaluation speed, portability, and ease of use, an attribute we believe precludes manual collection of kernel or hardware statistics.

preprint2020arXiv

A study of vectorization for matrix-free finite element methods

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30\% of theoretical peak performance for many examples of practical significance, and exceed 50\% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels.

preprint2020arXiv

On the Approximation of Local Expansions of Laplace Potentials by the Fast Multipole Method

In this paper, we present a generalization of the classical error bounds of Greengard-Rokhlin for the Fast Multipole Method (FMM) for Laplace potentials in three dimensions, extended to the case of local expansion (instead of point) targets. We also present a complementary, less sharp error bound proven via approximation theory whose applicability is not restricted to Laplace potentials. Our study is motivated by the GIGAQBX FMM, an algorithm for the fast, high-order accurate evaluation of layer potentials near and on the source layer. GIGAQBX is based on the FMM, but unlike a conventional FMM, which is designed to evaluate potentials at point-shaped targets, GIGAQBX evaluates local expansions of potentials at ball-shaped targets. Although the accuracy (or the acceleration error, i.e., error due to the approximation of the potential by the fast algorithm) of the conventional FMM is well understood, the acceleration error of FMM-based algorithms applied to the evaluation of local expansions has not been as well studied. The main contribution of this paper is a proof of a set of hypotheses first demonstrated numerically in the paper "A Fast Algorithm for Quadrature by Expansion in Three Dimensions," which pertain to the accuracy of FMM approximation of local expansions of Laplace potentials in three dimensions. These hypotheses are also essential to the three-dimensional error bound for GIGAQBX, which was previously stated conditionally on their truth and can now be stated unconditionally.

preprint2016arXiv

A Unified, Hardware-Fitted, Cross-GPU Performance Model

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run time. We use a series of `performance-instructive' kernels to fit the parameters of a unified model to the performance characteristics of GPU hardware from multiple hardware generations and vendors. We evaluate the predictive power of the model on a broad array of computational kernels relevant to scientific computing. In terms of the geometric mean, our simple, vendor- and GPU-type-independent model achieves relative accuracy comparable to that of previously published work using hardware specific models.

preprint2015arXiv

Loo.py: From Fortran to performance via transformation and substitution rules

A large amount of numerically-oriented code is written and is being written in legacy languages. Much of this code could, in principle, make good use of data-parallel throughput-oriented computer architectures. Loo.py, a transformation-based programming system targeted at GPUs and general data-parallel architectures, provides a mechanism for user-controlled transformation of array programs. This transformation capability is designed to not just apply to programs written specifically for Loo.py, but also those imported from other languages such as Fortran. It eases the trade-off between achieving high performance, portability, and programmability by allowing the user to apply a large and growing family of transformations to an input program. These transformations are expressed in and used from Python and may be applied from a variety of settings, including a pragma-like manner from other languages.

preprint2014arXiv

Loo.py: transformation-based code generation for GPUs and CPUs

Today's highly heterogeneous computing landscape places a burden on programmers wanting to achieve high performance on a reasonably broad cross-section of machines. To do so, computations need to be expressed in many different but mathematically equivalent ways, with, in the worst case, one variant per target machine. Loo.py, a programming system embedded in Python, meets this challenge by defining a data model for array-style computations and a library of transformations that operate on this model. Offering transformations such as loop tiling, vectorization, storage management, unrolling, instruction-level parallelism, change of data layout, and many more, it provides a convenient way to capture, parametrize, and re-unify the growth among code variants. Optional, deep integration with numpy and PyOpenCL provides a convenient computing environment where the transition from prototype to high-performance implementation can occur in a gradual, machine-assisted form.

preprint2014arXiv

Visualizing Skin Effects in Conductors with MRI: ${}^7$Li MRI Experiments and Calculations

While experiments on metals have been performed since the early days of NMR (and DNP), the use of bulk metal is normally avoided. Instead, often powders have been used in combination with low fields, so that skin depth effects could be neglected. Another complicating factor of acquiring NMR spectra or MRI images of bulk metal is the strong signal dependence on the orientation between the sample and the radio frequency (RF) coil, leading to non-intuitive image distortions and inaccurate quantification. Such factors are particularly important for NMR and MRI of batteries and other electrochemical devices. Here, we show results from a systematic study combining RF field calculations with experimental MRI of $^7$Li metal to visualize skin depth effects directly and to analyze the RF field orientation effect on MRI of bulk metal. It is shown that a certain degree of selectivity can be achieved for particular faces of the metal, simply based on the orientation of the sample. By combining RF field calculations with bulk magnetic susceptibility calculations accurate NMR spectra can be obtained from first principles. Such analyses will become valuable in many applications involving battery systems, but also metals, in general.

preprint2013arXiv

GPU Scripting and Code Generation with PyCUDA

High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyCUDA is a package that attempts to join the two together. This chapter argues that in doing so, a programming environment is created that is greater than just the sum of its two parts. We would like to note that nearly all of this chapter applies in unmodified form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the same concepts as PyCUDA for OpenCL.

preprint2013arXiv

On the convergence of local expansions of layer potentials

In a recently developed quadrature method (quadrature by expansion or QBX), it was demonstrated that weakly singular or singular layer potentials can be evaluated rapidly and accurately on surface by making use of local expansions about carefully chosen off-surface points. In this paper, we derive estimates for the rate of convergence of these local expansions, providing the analytic foundation for the QBX method. The estimates may also be of mathematical interest, particularly for microlocal or asymptotic analysis in potential theory.

preprint2013arXiv

Quadrature by Expansion: A New Method for the Evaluation of Layer Potentials

Integral equation methods for the solution of partial differential equations, when coupled with suitable fast algorithms, yield geometrically flexible, asymptotically optimal and well-conditioned schemes in either interior or exterior domains. The practical application of these methods, however, requires the accurate evaluation of boundary integrals with singular, weakly singular or nearly singular kernels. Historically, these issues have been handled either by low-order product integration rules (computed semi-analytically), by singularity subtraction/cancellation, by kernel regularization and asymptotic analysis, or by the construction of special purpose "generalized Gaussian quadrature" rules. In this paper, we present a systematic, high-order approach that works for any singularity (including hypersingular kernels), based only on the assumption that the field induced by the integral operator is locally smooth when restricted to either the interior or the exterior. Discontinuities in the field across the boundary are permitted. The scheme, denoted QBX (quadrature by expansion), is easy to implement and compatible with fast hierarchical algorithms such as the fast multipole method. We include accuracy tests for a variety of integral operators in two dimensions on smooth and corner domains.

preprint2013arXiv

Solving Wave Equations on Unstructured Geometries

Waves are all around us--be it in the form of sound, electromagnetic radiation, water waves, or earthquakes. Their study is an important basic tool across engineering and science disciplines. Every wave solver serving the computational study of waves meets a trade-off of two figures of merit--its computational speed and its accuracy. Discontinuous Galerkin (DG) methods fall on the high-accuracy end of this spectrum. Fortuitously, their computational structure is so ideally suited to GPUs that they also achieve very high computational speeds. In other words, the use of DG methods on GPUs significantly lowers the cost of obtaining accurate solutions. This article aims to give the reader an easy on-ramp to the use of this technology, based on a sample implementation which demonstrates a highly accurate, GPU-capable, real-time visualizing finite element solver in about 1500 lines of code.

preprint2012arXiv

A consistency condition for the vector potential in multiply-connected domains

A classical problem in electromagnetics concerns the representation of the electric and magnetic fields in the low-frequency or static regime, where topology plays a fundamental role. For multiply connected conductors, at zero frequency the standard boundary conditions on the tangential components of the magnetic field do not uniquely determine the vector potential. We describe a (gauge-invariant) consistency condition that overcomes this non-uniqueness and resolves a longstanding difficulty in inverting the magnetic field integral equation.

preprint2012arXiv

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. In a recent publication, we have shown that DG methods also adapt readily to execution on modern, massively parallel graphics processors (GPUs). A number of qualities of the method contribute to this suitability, reaching from locality of reference, through regularity of access patterns, to high arithmetic intensity. In this article, we illuminate a few of the more practical aspects of bringing DG onto a GPU, including the use of a Python-based metaprogramming infrastructure that was created specifically to support DG, but has found many uses across all disciplines of computational science.

preprint2011arXiv

PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.

preprint2011arXiv

Viscous Shock Capturing in a Time-Explicit Discontinuous Galerkin Method

We present a novel, cell-local shock detector for use with discontinuous Galerkin (DG) methods. The output of this detector is a reliably scaled, element-wise smoothness estimate which is suited as a control input to a shock capture mechanism. Using an artificial viscosity in the latter role, we obtain a DG scheme for the numerical solution of nonlinear systems of conservation laws. Building on work by Persson and Peraire, we thoroughly justify the detector's design and analyze its performance on a number of benchmark problems. We further explain the scaling and smoothing steps necessary to turn the output of the detector into a local, artificial viscosity. We close by providing an extensive array of numerical tests of the detector in use.

Andreas Klöckner

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Canonicalization of Batched Einstein Summations for Tuning Retrieval

Finite elements for Helmholtz equations with a nonlocal boundary condition

A Fast Algorithm with Error Bounds for Quadrature by Expansion

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

A study of vectorization for matrix-free finite element methods

On the Approximation of Local Expansions of Laplace Potentials by the Fast Multipole Method

A Unified, Hardware-Fitted, Cross-GPU Performance Model

Loo.py: From Fortran to performance via transformation and substitution rules

Loo.py: transformation-based code generation for GPUs and CPUs

Visualizing Skin Effects in Conductors with MRI: ${}^7$Li MRI Experiments and Calculations

GPU Scripting and Code Generation with PyCUDA

On the convergence of local expansions of layer potentials

Quadrature by Expansion: A New Method for the Evaluation of Layer Potentials

Solving Wave Equations on Unstructured Geometries

A consistency condition for the vector potential in multiply-connected domains

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

Viscous Shock Capturing in a Time-Explicit Discontinuous Galerkin Method