Source author record

Harald Köstler

Harald Köstler appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning physics.comp-ph Computational Engineering, Finance, and Science Mathematical Software Performance Computation and Language Computer Vision math.NA Neural and Evolutionary Computing Numerical Analysis physics.med-ph Programming Languages

Catalog footprint

What is connected

13works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_K$-dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3, factors each KV row into a position-free $c_{KV}$ and a 64-dim $k_r$ correctable in closed form; this structure motivates content-addressed caching as a natural fit rather than a GQA workaround. We present Irminsul, which extends SGLang's radix cache with content-hash keying over CDC-chunked segments and a $δ$-rotation rule for $k_r$. We evaluate three native MLA-MoE deployments - DeepSeek-V2-Lite (16B/2.4B), Kimi Moonlight-16B-A3B, and JoyAI-Flash (48B/3B) - with output-consistency on all three and recovery measured on the two endpoints; Irminsul recovers up to ~83% of prompt tokens above exact-prefix on agentic traffic while delivering 63% prefill energy savings per cache hit. We argue that content-addressed caching belongs in the serving stack as a first-class primitive, not a retrofit over prefix matching.

preprint2026arXiv

Safety and accuracy follow different scaling laws in clinical large language models

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

preprint2025arXiv

Multi-step retrieval and reasoning improves radiology question answering with large language models

Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose radiology Retrieval and Reasoning (RaR), a multi-step retrieval and reasoning framework designed to improve diagnostic accuracy, factual consistency, and clinical reliability of LLMs in radiology question answering. We evaluated 25 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. To assess generalizability, we additionally tested on an unseen internal dataset of 65 real-world radiology board examination questions. RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG. The greatest gains occurred in small-scale models, while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, RaR retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models showed gains from RaR (e.g., MedGemma-27B), indicating that retrieval remains beneficial despite embedded domain knowledge. These results highlight the potential of RaR to enhance factuality and diagnostic accuracy in radiology QA, warranting future studies to validate their clinical utility. All datasets, code, and the full RaR framework are publicly available to support open research and clinical translation.

preprint2022arXiv

Evolving Generalizable Multigrid-Based Helmholtz Preconditioners with Grammar-Guided Genetic Programming

Solving the indefinite Helmholtz equation is not only crucial for the understanding of many physical phenomena but also represents an outstandingly-difficult benchmark problem for the successful application of numerical methods. Here we introduce a new approach for evolving efficient preconditioned iterative solvers for Helmholtz problems with multi-objective grammar-guided genetic programming. Our approach is based on a novel context-free grammar, which enables the construction of multigrid preconditioners that employ a tailored sequence of operations on each discretization level. To find solvers that generalize well over the given domain, we propose a custom method of successive problem difficulty adaption, in which we evaluate a preconditioner's efficiency on increasingly ill-conditioned problem instances. We demonstrate our approach's effectiveness by evolving multigrid-based preconditioners for a two-dimensional indefinite Helmholtz problem that outperform several human-designed methods for different wavenumbers up to systems of linear equations with more than a million unknowns.

preprint2022arXiv

MD-Bench: A generic proxy-app toolbox for state-of-the-art molecular dynamics algorithms

Proxy-apps, or mini-apps, are simple self-contained benchmark codes with performance-relevant kernels extracted from real applications. Initially used to facilitate software-hardware co-design, they are a crucial ingredient for serious performance engineering, especially when dealing with large-scale production codes. MD-Bench is a new proxy-app in the area of classical short-range molecular dynamics. In contrast to existing proxy-apps in MD (e.g. miniMD and coMD) it does not resemble a single application code, but implements state-of-the art algorithms from multiple applications (currently LAMMPS and GROMACS). The MD-Bench source code is understandable, extensible and suited for teaching, benchmarking and researching MD algorithms. Primary design goals are transparency and simplicity, a developer is able to tinker with the source code down to the assembly level. This paper introduces MD-Bench, explains its design and structure, covers implemented optimization variants, and illustrates its usage on three examples.

preprint2021arXiv

Deep Learning for Real-Time Aerodynamic Evaluations of Arbitrary Vehicle Shapes

The aerodynamic optimization process of cars requires multiple iterations between aerodynamicists and stylists. Response Surface Modeling and Reduced-Order Modeling are commonly used to eliminate the overhead due to Computational Fluid Dynamics, leading to faster iterations. However, a primary drawback of these models is that they can work only on the parametrized geometric features they were trained with. This study evaluates if deep learning models can predict the drag coefficient for an arbitrary input geometry without explicit parameterization. We use two similar data sets based on the publicly available DrivAer geometry for training. We use a modified U-Net architecture that uses Signed Distance Fields to represent the input geometries. Our models outperform the existing models by at least 11% in prediction accuracy for the drag coefficient. We achieved this improvement by combining multiple data sets that were created using different parameterizations, which is not possible with the methods currently used. We have also shown that it is possible to predict velocity fields and drag coefficient concurrently and that simple data augmentation methods can improve the results. In addition, we have performed an occlusion sensitivity study on our models to understand what information is used to make predictions. From the occlusion sensitivity study, we showed that the models were able to identify the geometric features and have discovered that the model has learned to exploit the global information present in the SDF. In contrast to the currently operational method, where design changes are restricted to the initially defined parameters, this study brings surrogate models one step closer to the long-term goal of having a model that can be used for approximate aerodynamic evaluation of unseen, arbitrary vehicle shapes, thereby providing complete design freedom to the vehicle stylists.

preprint2021arXiv

Known Operator Learning and Hybrid Machine Learning in Medical Imaging -- A Review of the Past, the Present, and the Future

In this article, we perform a review of the state-of-the-art of hybrid machine learning in medical imaging. We start with a short summary of the general developments of the past in machine learning and how general and specialized approaches have been in competition in the past decades. A particular focus will be the theoretical and experimental evidence pro and contra hybrid modelling. Next, we inspect several new developments regarding hybrid machine learning with a particular focus on so-called known operator learning and how hybrid approaches gain more and more momentum across essentially all applications in medical imaging and medical image analysis. As we will point out by numerous examples, hybrid models are taking over in image reconstruction and analysis. Even domains such as physical simulation and scanner and acquisition design are being addressed using machine learning grey box modelling approaches. Towards the end of the article, we will investigate a few future directions and point out relevant areas in which hybrid modelling, meta learning, and other domains will likely be able to drive the state-of-the-art ahead.

preprint2020arXiv

lbmpy: Automatic code generation for efficient parallel lattice Boltzmann methods

Lattice Boltzmann methods are a popular mesoscopic alternative to macroscopic computational fluid dynamics solvers. Many variants have been developed that vary in complexity, accuracy, and computational cost. Extensions are available to simulate multi-phase, multi-component, turbulent, or non-Newtonian flows. In this work we present lbmpy, a code generation package that supports a wide variety of different methods and provides a generic development environment for new schemes as well. A high-level domain-specific language allows the user to formulate, extend and test various lattice Boltzmann schemes. The method specification is represented in a symbolic intermediate representation. Transformations that operate on this intermediate representation optimize and parallelize the method, yielding highly efficient lattice Boltzmann compute kernels not only for single- and two-relaxation-time schemes but also for multi-relaxation-time, cumulant, and entropically stabilized methods. An integration into the HPC framework waLBerla makes massively parallel, distributed simulations possible, which is demonstrated through scaling experiments on the SuperMUC-NG supercomputing system

preprint2020arXiv

tinyMD: A Portable and Scalable Implementation for Pairwise Interactions Simulations

This paper investigates the suitability of the AnyDSL partial evaluation framework to implement tinyMD: an efficient, scalable, and portable simulation of pairwise interactions among particles. We compare tinyMD with the miniMD proxy application that scales very well on parallel supercomputers. We discuss the differences between both implementations and contrast miniMD's performance for single-node CPU and GPU targets, as well as its scalability on SuperMUC-NG and Piz Daint supercomputers. Additionaly, we demonstrate tinyMD's flexibility by coupling it with the waLBerla multi-physics framework. This allow us to execute tinyMD simulations using the load-balancing mechanism implemented in waLBerla.

preprint2019arXiv

waLBerla: A block-structured high-performance framework for multiphysics simulations

Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building blocks for developing simulations on block-structured grids. The block-structured domain partitioning is flexible enough to handle complex geometries, while the structured grid within each block allows for highly efficient implementations of stencil-based algorithms. We present several example applications realized with waLBerla, ranging from lattice Boltzmann methods to rigid particle simulations. Most importantly, these methods can be coupled together, enabling multiphysics simulations. The framework uses meta-programming techniques to generate highly efficient code for CPUs and GPUs from a symbolic method formulation. To ensure software quality and performance portability, a continuous integration toolchain automatically runs an extensive test suite encompassing multiple compilers, hardware architectures, and software configurations.

preprint2015arXiv

A Python Extension for the Massively Parallel Multiphysics Simulation Framework waLBerla

We present a Python extension to the massively parallel HPC simulation toolkit waLBerla. waLBerla is a framework for stencil based algorithms operating on block-structured grids, with the main application field being fluid simulations in complex geometries using the lattice Boltzmann method. Careful performance engineering results in excellent node performance and good scalability to over 400,000 cores. To increase the usability and flexibility of the framework, a Python interface was developed. Python extensions are used at all stages of the simulation pipeline: They simplify and automate scenario setup, evaluation, and plotting. We show how our Python interface outperforms the existing text-file-based configuration mechanism, providing features like automatic nondimensionalization of physical quantities and handling of complex parameter dependencies. Furthermore, Python is used to process and evaluate results while the simulation is running, leading to smaller output files and the possibility to adjust parameters dependent on the current simulation state. C++ data structures are exported such that a seamless interfacing to other numerical Python libraries is possible. The expressive power of Python and the performance of C++ make development of efficient code with low time effort possible.

preprint2015arXiv

Massively Parallel Phase-Field Simulations for Ternary Eutectic Directional Solidification

Microstructures forming during ternary eutectic directional solidification processes have significant influence on the macroscopic mechanical properties of metal alloys. For a realistic simulation, we use the well established thermodynamically consistent phase-field method and improve it with a new grand potential formulation to couple the concentration evolution. This extension is very compute intensive due to a temperature dependent diffusive concentration. We significantly extend previous simulations that have used simpler phase-field models or were performed on smaller domain sizes. The new method has been implemented within the massively parallel HPC framework waLBerla that is designed to exploit current supercomputers efficiently. We apply various optimization techniques, including buffering techniques, explicit SIMD kernel vectorization, and communication hiding. Simulations utilizing up to 262,144 cores have been run on three different supercomputing architectures and weak scalability results are shown. Additionally, a hierarchical, mesh-based data reduction strategy is developed to keep the I/O problem manageable at scale.

preprint2011arXiv

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results

GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for a D3Q19 lattice Boltzmann based incompressible flow solver for GPGPUs and CPUs based on NVIDIA CUDA and OpenCL. Since the implemented algorithm is limited by memory bandwidth, we concentrate on improving memory access. Basic data layout issues for optimal data access are explained and discussed. Furthermore, the algorithmic steps are rearranged to improve scattered access of the GPU memory. The importance of occupancy is discussed as well as optimization strategies to improve overall concurrency. We arrive at a well-optimized GPU kernel, which is integrated into a larger framework that can handle single phase fluid flow simulations as well as particle-laden flows. Our 3D LBM GPU implementation reaches up to 650 MLUPS in single precision and 290 MLUPS in double precision on an NVIDIA Tesla C2070.

Harald Köstler

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

Safety and accuracy follow different scaling laws in clinical large language models

Multi-step retrieval and reasoning improves radiology question answering with large language models

Evolving Generalizable Multigrid-Based Helmholtz Preconditioners with Grammar-Guided Genetic Programming

MD-Bench: A generic proxy-app toolbox for state-of-the-art molecular dynamics algorithms

Deep Learning for Real-Time Aerodynamic Evaluations of Arbitrary Vehicle Shapes

Known Operator Learning and Hybrid Machine Learning in Medical Imaging -- A Review of the Past, the Present, and the Future

lbmpy: Automatic code generation for efficient parallel lattice Boltzmann methods

tinyMD: A Portable and Scalable Implementation for Pairwise Interactions Simulations

waLBerla: A block-structured high-performance framework for multiphysics simulations

A Python Extension for the Massively Parallel Multiphysics Simulation Framework waLBerla

Massively Parallel Phase-Field Simulations for Ternary Eutectic Directional Solidification

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results