Source author record

Kai Keller

Kai Keller appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

2works

1topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

A Framework for Large Scale Particle Filters Validated with Data Assimilation for Weather Simulation

Particle filters are a group of algorithms to solve inverse problems through statistical Bayesian methods when the model does not comply with the linear and Gaussian hypothesis. Particle filters are used in domains like data assimilation, probabilistic programming, neural networkoptimization, localization and navigation. Particle filters estimate the probabilitydistribution of model states by running a large number of model instances, the so called particles. The ability to handle a very large number of particles is critical for high dimensional models.This paper proposes a novel paradigm to run very large ensembles of parallel model instances on supercomputers. The approach combines an elastic and fault tolerant runner/server model minimizing data movementswhile enabling dynamic load balancing. Particle weights are computed locally on each runner andtransmitted when available to a server that normalizes them, resamples new particles based on their weight, and redistributes dynamically the work torunners to react to load imbalance. Our approach relies on a an asynchronously manageddistributed particle cache permitting particles to move from one runner to another inthe background while particle propagation goes on. This also enables the number ofrunners to vary during the execution either in reaction to failures and restarts, orto adapt to changing resource availability dictated by external decision processes.The approach is experimented with the Weather Research and Forecasting (WRF) model, toassess its performance for probabilistic weather forecasting. Up to 2555particles on 20442 compute cores are used to assimilate cloud cover observations into short--range weather forecasts over Europe.

preprint2020arXiv

Extending the OpenCHK Model with Advanced Checkpoint Features

One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and performance. There are many solutions implementing CR at the application level. They all provide advanced I/O capabilities to minimize the overhead introduced by CR. Nevertheless, there is still room for improvement in terms of programmability and flexibility, because end-users must manually serialize and deserialize application state using low-level APIs, modify the flow of the application to consider restarts, or rewrite CR code whenever the backend library changes. In this work, we propose a set of compiler directives and clauses that allow users to specify CR operations in a simple way. Our approach supports the common CR features provided by all the CR libraries. However, it can also be extended to support advanced features that are only available in some CR libraries, such as differential checkpointing, the use of HDF5 format, and the possibility of using fault-tolerance-dedicated threads. The result of our evaluation revealed a high increase in programmability. On average, we reduced the number of lines of code by 71%, 94%, and 64% for FTI, SCR, and VeloC, respectively, and no additional overhead was perceived using our solution compared to using the backend libraries directly. Finally, portability is enhanced because our programming model allows the use of any backend library without changing any code.