Source author record

Benjamin Hazelwood

Benjamin Hazelwood appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Mathematical Software Performance

Catalog footprint

What is connected

2works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes

High-order Discontinuous Galerkin (DG) methods promise to be an excellent discretisation paradigm for partial differential equation solvers by combining high arithmetic intensity with localised data access. They also facilitate dynamic adaptivity without the need for conformal meshes. A parallel evaluation of DG's weak formulation within a mesh traversal is non-trivial, as dependency graphs over dynamically adaptive meshes change, as causal constraints along resolution transitions have to be preserved, and as data sends along MPI domain boundaries have to be triggered in the correct order. We propose to process mesh elements subject to constraints with high priority or, where needed, serially throughout a traversal. The remaining cells form enclaves and are spawned into a task system. This introduces concurrency, mixes memory-intensive DG integrations with compute-bound Riemann solves, and overlaps computation and communication. We discuss implications on MPI and show that MPI parallelisation improves by a factor of three through enclave tasking, while we obtain an additional factor of two from shared memory if grids are dynamically adaptive.

preprint2020arXiv

TeaMPI -- Replication-based Resilience without the (Performance) Pain

In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naïvely mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine -- a task-based solver for hyperbolic equation systems -- that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned "for nothing". Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing.

Benjamin Hazelwood

What is connected

Connect this record

See the researcher in context

Building this map preview

2 published item(s)

Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes

TeaMPI -- Replication-based Resilience without the (Performance) Pain