Source author record

Michael Uder

Michael Uder appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language eess.AS Machine Learning physics.med-ph Sound

Catalog footprint

What is connected

3works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Safety and accuracy follow different scaling laws in clinical large language models

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

preprint2022arXiv

PoCaP Corpus: A Multimodal Dataset for Smart Operating Room Speech Assistant using Interventional Radiology Workflow Analysis

This paper presents a new multimodal interventional radiology dataset, called PoCaP (Port Catheter Placement) Corpus. This corpus consists of speech and audio signals in German, X-ray images, and system commands collected from 31 PoCaP interventions by six surgeons with average duration of 81.4 $\pm$ 41.0 minutes. The corpus aims to provide a resource for developing a smart speech assistant in operating rooms. In particular, it may be used to develop a speech controlled system that enables surgeons to control the operation parameters such as C-arm movements and table positions. In order to record the dataset, we acquired consent by the institutional review board and workers council in the University Hospital Erlangen and by the patients for data privacy. We describe the recording set-up, data structure, workflow and preprocessing steps, and report the first PoCaP Corpus speech recognition analysis results with 11.52 $\%$ word error rate using pretrained models. The findings suggest that the data has the potential to build a robust command recognition system and will allow the development of a novel intervention support systems using speech and image processing in the medical domain.

preprint2020arXiv

Contrast-to-noise ratio analysis of microscopic diffusion anisotropy indices in q-space trajectory imaging

Diffusion anisotropy in diffusion tensor imaging (DTI) is commonly quantified with normalized diffusion anisotropy indices (DAIs). Most often, the fractional anisotropy (FA) is used, but several alternative DAIs have been introduced in attempts to maximize the contrast-to-noise ratio (CNR) in diffusion anisotropy maps. Examples include the scaled relative anisotropy (sRA), the gamma variate anisotropy index (GV), the surface anisotropy (UAsurf), and the lattice index (LI). With the advent of multidimensional diffusion encoding it became possible to determine the presence of microscopic diffusion anisotropy in a voxel, which is theoretically independent of orientation coherence. In accordance with DTI, the microscopic anisotropy is typically quantified by the microscopic fractional anisotropy (uFA). In this work, in addition to the uFA, the four microscopic diffusion anisotropy indices (uDAIs) usRA, uGV, uUAsurf, and uLI are defined in analogy to the respective DAIs by means of the average diffusion tensor and the covariance tensor. Simulations with three representative distributions of microscopic diffusion tensors revealed distinct CNR differences when differentiating between isotropic and microscopically anisotropic diffusion. q-Space trajectory imaging (QTI) was employed to acquire brain in-vivo maps of all indices. For this purpose, a 15 min protocol featuring linear, planar, and spherical tensor encoding was used. The resulting maps were of good quality and exhibited different contrasts, e.g. between gray and white matter. This indicates that it may be beneficial to use more than one uDAI in future investigational studies.