Researcher profile

Andreas Klöckner

Andreas Klöckner contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

Canonicalization of Batched Einstein Summations for Tuning Retrieval

We present an algorithm for normalizing \emph{Batched Einstein Summation} expressions by mapping mathematically equivalent formulations to a unique normal form. Batches of einsums with the same Einstein notation that exhibit substantial data reuse appear frequently in finite element methods (FEM), numerical linear algebra, and computational chemistry. To effectively exploit this temporal locality for high performance, we consider groups of einsums in batched form. Representations of equivalent batched einsums may differ due to index renaming, permutations within the batch, and, due to the commutativity and associativity of multiplication operation. The lack of a canonical representation hinders the reuse of optimization and tuning knowledge in software systems. To this end, we develop a novel encoding of batched einsums as colored graphs and apply graph canonicalization to derive a normal form. In addition to the canonicalization algorithm, we propose a representation of einsums using functional array operands and provide a strategy to transfer transformations operating on the normal form to \emph{functional batched einsums} that exhibit the same normal form; crucial for fusing surrounding computations for memory bound einsums. We evaluate our approach against JAX, and observe a geomean speedup of $4.7\times$ for einsums from the TCCG benchmark suite and an FEM solver.

preprint2021arXiv

Finite elements for Helmholtz equations with a nonlocal boundary condition

Numerical resolution of exterior Helmholtz problems requires some approach to domain truncation. As an alternative to approximate nonreflecting boundary conditions and invocation of the Dirichlet-to-Neumann map, we introduce a new, nonlocal boundary condition. This condition is exact and requires the evaluation of layer potentials involving the free space Green's function. However, it seems to work in general unstructured geometry, and Galerkin finite element discretization leads to convergence under the usual mesh constraints imposed by Gårding-type inequalities. The nonlocal boundary conditions are readily approximated by fast multipole methods, and the resulting linear system can be preconditioned by the purely local operator involving transmission boundary conditions.

preprint2020arXiv

A Fast Algorithm with Error Bounds for Quadrature by Expansion

Quadrature by Expansion (QBX) is a quadrature method for approximating the value of the singular integrals encountered in the evaluation of layer potentials. It exploits the smoothness of the layer potential by forming locally-valid expansion which are then evaluated to compute the near or on-surface value of the integral. Recent work towards coupling of a Fast Multipole Method (FMM) to QBX yielded a first step towards the rapid evaluation of such integrals (and the solution of related integral equations), albeit with only empirically understood error behavior. In this paper, we improve upon this approach with a modified algorithm for which we give a comprehensive analysis of error and cost in the case of the Laplace equation in two dimensions. For the same levels of (user-specified) accuracy, the new algorithm empirically has cost-per-accuracy comparable to prior approaches. We provide experimental results to demonstrate scalability and numerical accuracy.

preprint2020arXiv

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance, parallel applications. In today's increasingly heterogeneous computing environment, this task must be accomplished efficiently across multiple architectures, including massively parallel coprocessors like GPUs. To address this challenge, we present an approach for constructing customizable, cross-machine performance models for GPU kernels, including a mechanism to automatically and symbolically gather performance-relevant kernel operation counts, a tool for formulating mathematical models using these counts, and a customizable parameterized collection of benchmark kernels used to calibrate models to GPUs in a black-box fashion. Our approach empowers a user to manage trade-offs between model accuracy, evaluation speed, and generalizability. A user can define a model and customize the calibration process, making it as simple or complex as desired, and as application-targeted or general as desired. To evaluate our approach, we demonstrate both linear and nonlinear models; each example models execution times for multiple variants of a particular computation: two matrix multiplication variants, four Discontinuous Galerkin (DG) differentiation operation variants, and two 2-D five-point finite difference stencil variants. For each variant, we present accuracy results on GPUs from multiple vendors and hardware generations. We view this customizable approach as a response to a central question in GPU performance modeling: how can we model GPU performance in a cost-explanatory fashion while maintaining accuracy, evaluation speed, portability, and ease of use, an attribute we believe precludes manual collection of kernel or hardware statistics.

preprint2020arXiv

A study of vectorization for matrix-free finite element methods

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30\% of theoretical peak performance for many examples of practical significance, and exceed 50\% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels.

preprint2020arXiv

On the Approximation of Local Expansions of Laplace Potentials by the Fast Multipole Method

In this paper, we present a generalization of the classical error bounds of Greengard-Rokhlin for the Fast Multipole Method (FMM) for Laplace potentials in three dimensions, extended to the case of local expansion (instead of point) targets. We also present a complementary, less sharp error bound proven via approximation theory whose applicability is not restricted to Laplace potentials. Our study is motivated by the GIGAQBX FMM, an algorithm for the fast, high-order accurate evaluation of layer potentials near and on the source layer. GIGAQBX is based on the FMM, but unlike a conventional FMM, which is designed to evaluate potentials at point-shaped targets, GIGAQBX evaluates local expansions of potentials at ball-shaped targets. Although the accuracy (or the acceleration error, i.e., error due to the approximation of the potential by the fast algorithm) of the conventional FMM is well understood, the acceleration error of FMM-based algorithms applied to the evaluation of local expansions has not been as well studied. The main contribution of this paper is a proof of a set of hypotheses first demonstrated numerically in the paper "A Fast Algorithm for Quadrature by Expansion in Three Dimensions," which pertain to the accuracy of FMM approximation of local expansions of Laplace potentials in three dimensions. These hypotheses are also essential to the three-dimensional error bound for GIGAQBX, which was previously stated conditionally on their truth and can now be stated unconditionally.