Source author record

Michael Bader

Michael Bader appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Mathematical Software physics.comp-ph math.NA Numerical Analysis Performance physics.geo-ph Software Engineering General Literature Graphics

Catalog footprint

What is connected

10works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

An Efficient ADER-DG Local Time Stepping Scheme for 3D HPC Simulation of Seismic Waves in Poroelastic Media

Many applications from geosciences require simulations of seismic waves in porous media. Biot's theory of poroelasticity describes the coupling between solid and fluid phases and introduces a stiff source term, thereby increasing computational cost and motivating efficient methods utilising High-Performance Computing. We present a novel realisation of the discontinuous Galerkin scheme with Arbitrary DERivative time stepping (ADER-DG) that copes with stiff source terms. To integrate this source term with a reasonable time step size, we use an element-local space-time predictor, which needs to solve medium-sized linear systems - with 1000 to 10000 unknowns - in each element update (i.e., billions of times). We present a novel block-wise back-substitution algorithm for solving these systems efficiently. In comparison to LU decomposition, we reduce the number of floating-point operations by a factor of up to 25. The block-wise back-substitution is mapped to a sequence of small matrix-matrix multiplications, for which code generators are available to generate highly optimised code. We verify the new solver thoroughly in problems of increasing complexity. We demonstrate high-order convergence for 3D problems. We verify the correct treatment of point sources, material interfaces and traction-free boundary conditions. In addition, we compare against a finite difference code for a newly defined layer over half-space problem. We find that extremely high accuracy is required to resolve the slow P-wave at a free surface, while solid particle velocities are not affected by coarser resolutions. By using a clustered local time stepping scheme, we reduce time to solution by a factor of 6 to 10 compared to global time stepping. We conclude our study with a scaling and performance analysis, demonstrating our implementation's efficiency and its potential for extreme-scale simulations.

preprint2020arXiv

A stable discontinuous Galerkin method for the perfectly matched layer for elastodynamics in first order form

We present a stable discontinuous Galerkin (DG) method with a perfectly matched layer (PML) for three and two space dimensional linear elastodynamics, in velocity-stress formulation, subject to well-posed linear boundary conditions. First, we consider the elastodynamics equation, in a cuboidal domain, and derive an unsplit PML truncating the domain using complex coordinate stretching. Leveraging the hyperbolic structure of the underlying system, we construct continuous energy estimates, in the time domain for the elastic wave equation, and in the Laplace space for a sequence of PML model problems, with variations in one, two and three space dimensions, respectively. They correspond to PMLs normal to boundary faces, along edges and in corners. Second, we develop a DG numerical method for the linear elastodynamics equation using physically motivated numerical flux and penalty parameters, which are compatible with all well-posed, internal and external, boundary conditions. When the PML damping vanishes, by construction, our choice of penalty parameters yield an upwind scheme and a discrete energy estimate analogous to the continuous energy estimate. Third, to ensure numerical stability of the discretization when PML damping is present, it is necessary to extend the numerical DG fluxes, and the numerical inter-element and boundary procedures, to the PML auxiliary differential equations. This is crucial for deriving discrete energy estimates analogous to the continuous energy estimates. By combining the DG spatial approximation with the high order ADER time stepping scheme and the accuracy of the PML we obtain an arbitrarily accurate wave propagation solver in the time domain. Numerical experiments are presented in two and three space dimensions corroborating the theoretical results.

preprint2020arXiv

An Environment for Sustainable Research Software in Germany and Beyond: Current State, Open Challenges, and Call for Action

Research software has become a central asset in academic research. It optimizes existing and enables new research methods, implements and embeds research knowledge, and constitutes an essential research product in itself. Research software must be sustainable in order to understand, replicate, reproduce, and build upon existing research or conduct new research effectively. In other words, software must be available, discoverable, usable, and adaptable to new needs, both now and in the future. Research software therefore requires an environment that supports sustainability. Hence, a change is needed in the way research software development and maintenance are currently motivated, incentivized, funded, structurally and infrastructurally supported, and legally treated. Failing to do so will threaten the quality and validity of research. In this paper, we identify challenges for research software sustainability in Germany and beyond, in terms of motivation, selection, research software engineering personnel, funding, infrastructure, and legal aspects. Besides researchers, we specifically address political and academic decision-makers to increase awareness of the importance and needs of sustainable research software practices. In particular, we recommend strategies and measures to create an environment for sustainable research software, with the ultimate goal to ensure that software-driven research is valid, reproducible and sustainable, and that software is recognized as a first class citizen in research. This paper is the outcome of two workshops run in Germany in 2019, at deRSE19 - the first International Conference of Research Software Engineers in Germany - and a dedicated DFG-supported follow-up workshop in Berlin.

preprint2020arXiv

ExaHyPE: An Engine for Parallel Dynamically Adaptive Simulations of Wave Problems

ExaHyPE ("An Exascale Hyperbolic PDE Engine") is a software engine for solving systems of first-order hyperbolic partial differential equations (PDEs). Hyperbolic PDEs are typically derived from the conservation laws of physics and are useful in a wide range of application areas. Applications powered by ExaHyPE can be run on a student's laptop, but are also able to exploit thousands of processor cores on state-of-the-art supercomputers. The engine is able to dynamically increase the accuracy of the simulation using adaptive mesh refinement where required. Due to the robustness and shock capturing abilities of ExaHyPE's numerical methods, users of the engine can simulate linear and non-linear hyperbolic PDEs with very high accuracy. Users can tailor the engine to their particular PDE by specifying evolved quantities, fluxes, and source terms. A complete simulation code for a new hyperbolic PDE can often be realised within a few hours - a task that, traditionally, can take weeks, months, often years for researchers starting from scratch. In this paper, we showcase ExaHyPE's workflow and capabilities through real-world scenarios from our two main application areas: seismology and astrophysics.

preprint2020arXiv

Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement

Balancing the workload of sophisticated simulations is inherently difficult, since we have to balance both computational workload and memory footprint over meshes that can change any time or yield unpredictable cost per mesh entity, while modern supercomputers and their interconnects start to exhibit fluctuating performance. We propose a novel lightweight balancing technique for MPI+X to accompany traditional, prediction-based load balancing. It is a reactive diffusion approach that uses online measurements of MPI idle time to migrate tasks temporarily from overloaded to underemployed ranks. Tasks are deployed to ranks which otherwise would wait, processed with high priority, and made available to the overloaded ranks again. This migration is non-persistent. Our approach hijacks idle time to do meaningful work and is totally non-blocking, asynchronous and distributed without a global data view. Tests with a seismic simulation code developed in the ExaHyPE engine uncover the method's potential. We found speed-ups of up to 2-3 for ill-balanced scenarios without logical modifications of the code base and show that the strategy is capable to react quickly to temporarily changing workload or node performance.

preprint2020arXiv

Role-Oriented Code Generation in an Engine for Solving Hyperbolic PDE Systems

The development of a high performance PDE solver requires the combined expertise of interdisciplinary teams with respect to application domain, numerical scheme and low-level optimization. In this paper, we present how the ExaHyPE engine facilitates the collaboration of such teams by isolating three roles: application, algorithms, and optimization expert. We thus support team members in letting them focus on their own area of expertise while integrating their contributions into an HPC production code. Inspired by web application development practices, ExaHyPE relies on two custom code generation modules, the Toolkit and the Kernel Generator, which follow a Model-View-Controller architectural pattern on top of the Jinja2 template engine library. Using Jinja2's templates to abstract the critical components of the engine and generated glue code, we isolate the application development from the engine. The template language also allows us to define and use custom template macros that isolate low-level optimizations from the numerical scheme described in the templates. We present three use cases, each focusing on one of our user roles, showcasing how the design of the code generation modules allows to easily expand the solver schemes to support novel demands from applications, to add optimized algorithmic schemes (with reduced memory footprint, e.g.), or provide improved low-level SIMD vectorization support.

preprint2020arXiv

TeaMPI -- Replication-based Resilience without the (Performance) Pain

In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naïvely mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine -- a task-based solver for hyperbolic equation systems -- that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned "for nothing". Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing.

preprint2020arXiv

Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes

We present a sequence of optimizations to the performance-critical compute kernels of the high-order discontinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE -- successively tackling bottlenecks due to SIMD operations, cache hierarchies and restrictions in the software design. Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improving the data layout and using Loop-over-GEMM to perform tensor contractions via highly optimized matrix multiplication functions provided by the LIBXSMM library. We show that memory stalls due to a memory footprint exceeding our L2 cache size hindered the vectorization gains. We therefore introduce a new kernel that applies a sum factorization approach to reduce the kernel's memory footprint and improve its cache locality. With the L2 cache bottleneck removed, we were able to exploit additional vectorization opportunities, by introducing a hybrid Array-of-Structure-of-Array data layout that solves the data layout conflict between matrix multiplications kernels and the point-wise functions to implement PDE-specific terms. With this last kernel, evaluated in a benchmark simulation at high polynomial order, only 2\% of the floating point operations are still performed using scalar instructions and 22.5\% of the available performance is achieved.

preprint2019arXiv

A High-Order Discontinuous Galerkin Solver with Dynamic Adaptive Mesh Refinement to Simulate Cloud Formation Processes

We present a high-order discontinuous Galerkin (DG) solver of the compressible Navier-Stokes equations for cloud formation processes. The scheme exploits an underlying parallelized implementation of the ADER-DG method with dynamic adaptive mesh refinement. We improve our method by a PDE-independent general refinement criterion, based on the local total variation of the numerical solution. While established methods use numerics tailored towards the specific simulation, our scheme works scenario independent. Our generic scheme shows competitive results for both classical CFD and stratified scenarios. We focus on two dimensional simulations of two bubble convection scenarios over a background atmosphere. The largest simulation here uses order 6 and 6561 cells which were reduced to 1953 cells by our refinement criterion.

preprint2010arXiv

Fast GPGPU Data Rearrangement Kernels using CUDA

Many high performance-computing algorithms are bandwidth limited, hence the need for optimal data rearrangement kernels as well as their easy integration into the rest of the application. In this work, we have built a CUDA library of fast kernels for a set of data rearrangement operations. In particular, we have built generic kernels for rearranging m dimensional data into n dimensions, including Permute, Reorder, Interlace/De-interlace, etc. We have also built kernels for generic Stencil computations on a two-dimensional data using templates and functors that allow application developers to rapidly build customized high performance kernels. All the kernels built achieve or surpass best-known performance in terms of bandwidth utilization.

Michael Bader

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

An Efficient ADER-DG Local Time Stepping Scheme for 3D HPC Simulation of Seismic Waves in Poroelastic Media

A stable discontinuous Galerkin method for the perfectly matched layer for elastodynamics in first order form

An Environment for Sustainable Research Software in Germany and Beyond: Current State, Open Challenges, and Call for Action

ExaHyPE: An Engine for Parallel Dynamically Adaptive Simulations of Wave Problems

Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement

Role-Oriented Code Generation in an Engine for Solving Hyperbolic PDE Systems

TeaMPI -- Replication-based Resilience without the (Performance) Pain

Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes

A High-Order Discontinuous Galerkin Solver with Dynamic Adaptive Mesh Refinement to Simulate Cloud Formation Processes

Fast GPGPU Data Rearrangement Kernels using CUDA