Source author record

Christian Plessl

Christian Plessl appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing physics.comp-ph physics.chem-ph cond-mat.mtrl-sci Hardware Architecture math.NA Performance cond-mat.stat-mech cond-mat.str-el math.RA Numerical Analysis quant-ph Quantitative Methods

Catalog footprint

What is connected

11works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Breaking the Exascale Barrier for the Electronic Structure Problem in Ab-Initio Molecular Dynamics

The non-orthogonal local submatrix method applied to electronic-structure based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32 mixed floating-point arithmetic when using 4,400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms.

preprint2022arXiv

CP2K on the road to exascale

The CP2K program package, which can be considered as the swiss army knife of atomistic simulations, is presented with a special emphasis on ab-initio molecular dynamics using the second-generation Car-Parrinello method. After outlining current and near-term development efforts with regards to massively parallel low-scaling post-Hartree-Fock and eigenvalue solvers, novel approaches on how we plan to take full advantage of future low-precision hardware architectures are introduced. Our focus here is on combining our submatrix method with the approximate computing paradigm to address the immanent exascale era.

preprint2022arXiv

Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks

While FPGA accelerator boards and their respective high-level design tools are maturing, there is still a lack of multi-FPGA applications, libraries, and not least, benchmarks and reference implementations towards sustained HPC usage of these devices. As in the early days of GPUs in HPC, for workloads that can reasonably be decoupled into loosely coupled working sets, multi-accelerator support can be achieved by using standard communication interfaces like MPI on the host side. However, for performance and productivity, some applications can profit from a tighter coupling of the accelerators. FPGAs offer unique opportunities here when extending the dataflow characteristics to their communication ininterfaces. In this work, we extend the HPCC FPGA benchmark suite by multi-FPGA support and three missing benchmarks that particularly characterize or stress inter-device communication: b_eff, PTRANS, and LINPACK. With all benchmarks implemented for current boards with Intel and Xilinx FPGAs, we established a baseline for multi-FPGA performance. Additionally, for the communication-centric benchmarks, we explored the potential of direct FPGA-to-FPGA communication with a circuit-switched inter-FPGA network that is currently only available for one of the boards. The evaluation with parallel execution on up to 26 FPGA boards makes use of one of the largest academic FPGA installations.

preprint2022arXiv

Parallel Quantum Chemistry on Noisy Intermediate-Scale Quantum Computers

A novel parallel hybrid quantum-classical algorithm for the solution of the quantum-chemical ground-state energy problem on gate-based quantum computers is presented. This approach is based on the reduced density-matrix functional theory (RDMFT) formulation of the electronic structure problem. For that purpose, the density-matrix functional of the full system is decomposed into an indirectly coupled sum of density-matrix functionals for all its subsystems using the adaptive cluster approximation to RDMFT. The approximations involved in the decomposition and the adaptive cluster approximation itself can be systematically converged to the exact result. The solutions for the density-matrix functionals of the effective subsystems involves a constrained minimization over many-particle states that are approximated by parametrized trial states on the quantum computer similarly to the variational quantum eigensolver. The independence of the density-matrix functionals of the effective subsystems introduces a new level of parallelization and allows for the computational treatment of much larger molecules on a quantum computer with a given qubit count. In addition, for the proposed algorithm techniques are presented to reduce the qubit count, the number of quantum programs, as well as its depth. The new approach is demonstrated for Hubbard-like systems on IBM quantum computers based on superconducting transmon qubits.

preprint2022arXiv

Towards Electronic Structure-Based Ab-Initio Molecular Dynamics Simulations with Hundreds of Millions of Atoms

We push the boundaries of electronic structure-based \textit{ab-initio} molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.

preprint2020arXiv

Accurate Sampling with Noisy Forces from Approximate Computing

In scientific computing, the acceleration of atomistic computer simulations by means of custom hardware is finding ever growing application. A major limitation, however, is that the high efficiency in terms of performance and low power consumption entails the massive usage of low-precision computing units. Here, based on the approximate computing paradigm, we present an algorithmic method to rigorously compensate for numerical inaccuracies due to low-accuracy arithmetic operations, yet still obtaining exact expectation values using a properly modified Langevin-type equation.

preprint2020arXiv

CP2K: An Electronic Structure and Molecular Dynamics Software Package -- Quickstep: Efficient and Accurate Electronic Structure Calculations

CP2K is an open source electronic structure and molecular dynamics software package to perform atomistic simulations of solid-state, liquid, molecular and biological systems. It is especially aimed at massively-parallel and linear-scaling electronic structure methods and state-of-the-art ab-initio molecular dynamics simulations. Excellent performance for electronic structure calculations is achieved using novel algorithms implemented for modern high-performance computing systems. This review revisits the main capabilities of CP2k to perform efficient and accurate electronic structure simulations. The emphasis is put on density functional theory and multiple post-Hartree-Fock methods using the Gaussian and plane wave approach and its augmented all-electron extension.

preprint2020arXiv

Efficient Ab-Initio Molecular Dynamic Simulations by Offloading Fast Fourier Transformations to FPGAs

A large share of today's HPC workloads is used for Ab-Initio Molecular Dynamics (AIMD) simulations, where the interatomic forces are computed on-the-fly by means of accurate electronic structure calculations. They are computationally intensive and thus constitute an interesting application class for energy-efficient hardware accelerators such as FPGAs. In this paper, we investigate the potential of offloading 3D Fast Fourier Transformations (FFTs) as a critical routine of plane-wave-based electronic structure calculations to FPGA and in conjunction demonstrate the tolerance of these simulations to lower precision computations.

preprint2020arXiv

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite

FPGAs have found increasing adoption in data center applications since a new generation of high-level tools have become available which noticeably reduce development time for FPGA accelerators and still provide high quality of results. There is however no high-level benchmark suite available which specifically enables a comparison of FPGA architectures, programming tools and libraries for HPC applications. To fill this gap, we have developed an OpenCL-based open source implementation of the HPCC benchmark suite for Xilinx and Intel FPGAs. This benchmark can serve to analyze the current capabilities of FPGA devices, cards and development tool flows, track progress over time and point out specific difficulties for FPGA acceleration in the HPC domain. Additionally, the benchmark documents proven performance optimization patterns. We will continue optimizing and porting the benchmark for new generations of FPGAs and design tools and encourage active participation to create a valuable tool for the community.

preprint2018arXiv

A General Algorithm to Calculate the Inverse Principal $p$-th Root of Symmetric Positive Definite Matrices

We address the general mathematical problem of computing the inverse $p$-th root of a given matrix in an efficient way. A new method to construct iteration functions that allow calculating arbitrary $p$-th roots and their inverses of symmetric positive definite matrices is presented. We show that the order of convergence is at least quadratic and that adaptively adjusting a parameter $q$ always leads to an even faster convergence. In this way, a better performance than with previously known iteration schemes is achieved. The efficiency of the iterative functions is demonstrated for various matrices with different densities, condition numbers and spectral radii.

preprint2014arXiv

Easy-to-Use On-the-Fly Binary Program Acceleration on Many-Cores

This paper introduces Binary Acceleration At Runtime (BAAR), an easy-to-use on-the-fly binary acceleration mechanism which aims to tackle the problem of enabling existent software to automatically utilize accelerators at runtime. BAAR is based on the LLVM Compiler Infrastructure and has a client-server architecture. The client runs the program to be accelerated in an environment which allows program analysis and profiling. Program parts which are identified as suitable for the available accelerator are exported and sent to the server. The server optimizes these program parts for the accelerator and provides RPC execution for the client. The client transforms its program to utilize accelerated execution on the server for offloaded program parts. We evaluate our work with a proof-of-concept implementation of BAAR that uses an Intel Xeon Phi 5110P as the acceleration target and performs automatic offloading, parallelization and vectorization of suitable program parts. The practicality of BAAR for real-world examples is shown based on a study of stencil codes. Our results show a speedup of up to 4x without any developer-provided hints and 5.77x with hints over the same code compiled with the Intel Compiler at optimization level O2 and running on an Intel Xeon E5-2670 machine. Based on our insights gained during implementation and evaluation we outline future directions of research, e.g., offloading more fine-granular program parts than functions, a more sophisticated communication mechanism or introducing on-stack-replacement.

Christian Plessl

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Breaking the Exascale Barrier for the Electronic Structure Problem in Ab-Initio Molecular Dynamics

CP2K on the road to exascale

Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks

Parallel Quantum Chemistry on Noisy Intermediate-Scale Quantum Computers

Towards Electronic Structure-Based Ab-Initio Molecular Dynamics Simulations with Hundreds of Millions of Atoms

Accurate Sampling with Noisy Forces from Approximate Computing

CP2K: An Electronic Structure and Molecular Dynamics Software Package -- Quickstep: Efficient and Accurate Electronic Structure Calculations

Efficient Ab-Initio Molecular Dynamic Simulations by Offloading Fast Fourier Transformations to FPGAs

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite

A General Algorithm to Calculate the Inverse Principal $p$-th Root of Symmetric Positive Definite Matrices

Easy-to-Use On-the-Fly Binary Program Acceleration on Many-Cores