Researcher profile

Eyke Hüllermeier

Eyke Hüllermeier contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
22works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

22 published item(s)

preprint2026arXiv

ConfoundingSHAP: Quantifying confounding strength in causal inference

In causal inference, confounders are variables that influence both treatment decisions and outcomes. However, unlike as in randomized clinical trials, the treatment assignment mechanism in observational studies is not known, and it is thus unclear which covariates act as confounders. Here, we aim to generate insight for causal inference and answer: which of the observed covariates act as confounders? We introduce ConfoundingSHAP, a Shapley-based method for attributing confounding strength to individual covariates. Our contributions are twofold. First, we propose a Shapley game targeted to infer the confounding strength of the covariates. Our resulting Shapley values differ from the standard applications of SHAP explanations on causal targets, such as understanding treatment effect heterogeneity, which are ill-suited for our task. Second, as our task requires evaluating the value function over many adjustment sets, we provide a scalable TabPFN-based estimation that avoids exhaustive refitting. We demonstrate the practical value across various datasets, where ConfoundingSHAP provides informative explanations of which observed covariates drive confounding and thereby helps to provide more insight for causal inference in practice.

preprint2026arXiv

Position: agentic AI orchestration should be Bayes-consistent

LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions. Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target. In contrast, this paper argues that coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters. This paper articulates practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration, and provides concrete examples and design patterns to illustrate how calibrated beliefs and utility-aware policies can improve agentic AI orchestration.

preprint2025arXiv

Adjusted Count Quantification Learning on Graphs

Quantification learning is the task of predicting the label distribution of a set of instances. We study this problem in the context of graph-structured data, where the instances are vertices. Previously, this problem has only been addressed via node clustering methods. In this paper, we extend the popular Adjusted Classify & Count (ACC) method to graphs. We show that the prior probability shift assumption upon which ACC relies is often not applicable to graph quantification problems. To address this issue, we propose structural importance sampling (SIS), the first graph quantification method that is applicable under (structural) covariate shift. Additionally, we propose Neighborhood-aware ACC, which improves quantification in the presence of non-homophilic edges. We show the effectiveness of our techniques on multiple graph quantification tasks.

preprint2025arXiv

Distribution Matching for Graph Quantification Under Structural Covariate Shift

Graphs are commonly used in machine learning to model relationships between instances. Consider the task of predicting the political preferences of users in a social network; to solve this task one should consider, both, the features of each individual user and the relationships between them. However, oftentimes one is not interested in the label of a single instance but rather in the distribution of labels over a set of instances; e.g., when predicting the political preferences of users, the overall prevalence of a given opinion might be of higher interest than the opinion of a specific person. This label prevalence estimation task is commonly referred to as quantification learning (QL). Current QL methods for tabular data are typically based on the so-called prior probability shift (PPS) assumption which states that the label-conditional instance distributions should remain equal across the training and test data. In the graph setting, PPS generally does not hold if the shift between training and test data is structural, i.e., if the training data comes from a different region of the graph than the test data. To address such structural shifts, an importance sampling variant of the popular adjusted count quantification approach has previously been proposed. In this work, we extend the idea of structural importance sampling to the state-of-the-art KDEy quantification approach. We show that our proposed method adapts to structural shifts and outperforms standard quantification approaches.

preprint2025arXiv

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings while making minimal assumptions about the strength signal. Our contributions are threefold: (1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals; (2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and (3) the Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.

preprint2023arXiv

Second-Order Uncertainty Quantification: Variance-Based Measures

Uncertainty quantification is a critical aspect of machine learning models, providing important insights into the reliability of predictions and aiding the decision-making process in real-world applications. This paper proposes a novel way to use variance-based measures to quantify uncertainty on the basis of second-order distributions in classification problems. A distinctive feature of the measures is the ability to reason about uncertainties on a class-based level, which is useful in situations where nuanced decision-making is required. Recalling some properties from the literature, we highlight that the variance-based measures satisfy important (axiomatic) properties. In addition to this axiomatic approach, we present empirical results showing the measures to be effective and competitive to commonly used entropy-based measures.

preprint2022arXiv

Conformal Prediction Intervals for Remaining Useful Lifetime Estimation

The main objective of Prognostics and Health Management is to estimate the Remaining Useful Lifetime (RUL), namely, the time that a system or a piece of equipment is still in working order before starting to function incorrectly. In recent years, numerous machine learning algorithms have been proposed for RUL estimation, mainly focusing on providing more accurate RUL predictions. However, there are many sources of uncertainty in the problem, such as inherent randomness of systems failure, lack of knowledge regarding their future states, and inaccuracy of the underlying predictive models, making it infeasible to predict the RULs precisely. Hence, it is of utmost importance to quantify the uncertainty alongside the RUL predictions. In this work, we investigate the conformal prediction (CP) framework that represents uncertainty by predicting sets of possible values for the target variable (intervals in the case of RUL) instead of making point predictions. Under very mild technical assumptions, CP formally guarantees that the actual value (true RUL) is covered by the predicted set with a degree of certainty that can be prespecified. We study three CP algorithms to conformalize any single-point RUL predictor and turn it into a valid interval predictor. Finally, we conformalize two single-point RUL predictors, deep convolutional neural networks and gradient boosting, and illustrate their performance on the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) data sets.

preprint2022arXiv

Non-Stationary Dueling Bandits

We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments, each of which is associated with its own preference matrix. The learner repeatedly selects a pair of arms and observes a binary preference between them as feedback. To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible, despite preference matrices and segment lengths being unknown. We propose the $\mathrm{Beat\, the\, Winner\, Reset}$ algorithm and prove a bound on its expected binary weak regret in the stationary case, which tightens the bound of current state-of-art algorithms. We also show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$. We further propose and analyze two meta-algorithms, $\mathrm{DETECT}$ for weak regret and $\mathrm{Monitored\, Dueling\, Bandits}$ for strong regret, both based on a detection-window approach that can incorporate any dueling bandit algorithm as a black-box algorithm. Finally, we prove a worst-case lower bound for expected weak regret in the non-stationary case.

preprint2022arXiv

Set-valued prediction in hierarchical classification with constrained representation complexity

Set-valued prediction is a well-known concept in multi-class classification. When a classifier is uncertain about the class label for a test instance, it can predict a set of classes instead of a single class. In this paper, we focus on hierarchical multi-class classification problems, where valid sets (typically) correspond to internal nodes of the hierarchy. We argue that this is a very strong restriction, and we propose a relaxation by introducing the notion of representation complexity for a predicted set. In combination with probabilistic classifiers, this leads to a challenging inference problem for which specific combinatorial optimization algorithms are needed. We propose three methods and evaluate them on benchmark datasets: a naïve approach that is based on matrix-vector multiplication, a reformulation as a knapsack problem with conflict graph, and a recursive tree search method. Experimental results demonstrate that the last method is computationally more efficient than the other two approaches, due to a hierarchical factorization of the conditional class distribution.

preprint2021arXiv

iART: A Search Engine for Art-Historical Images to Support Research in the Humanities

In this paper, we introduce iART: an open Web platform for art-historical research that facilitates the process of comparative vision. The system integrates various machine learning techniques for keyword- and content-based image retrieval as well as category formation via clustering. An intuitive GUI supports users to define queries and explore results. By using a state-of-the-art cross-modal deep learning approach, it is possible to search for concepts that were not previously detected by trained classification models. Art-historical objects from large, openly licensed collections such as Amsterdam Rijksmuseum and Wikidata are made available to users.

preprint2020arXiv

A Novel Higher-order Weisfeiler-Lehman Graph Convolution

Current GNN architectures use a vertex neighborhood aggregation scheme, which limits their discriminative power to that of the 1-dimensional Weisfeiler-Lehman (WL) graph isomorphism test. Here, we propose a novel graph convolution operator that is based on the 2-dimensional WL test. We formally show that the resulting 2-WL-GNN architecture is more discriminative than existing GNN approaches. This theoretical result is complemented by experimental studies using synthetic and real data. On multiple common graph classification benchmarks, we demonstrate that the proposed model is competitive with state-of-the-art graph kernels and GNNs.

preprint2020arXiv

Aleatoric and Epistemic Uncertainty with Random Forests

Due to the steadily increasing relevance of machine learning for practical applications, many of which are coming with safety requirements, the notion of uncertainty has received increasing attention in machine learning research in the last couple of years. In particular, the idea of distinguishing between two important types of uncertainty, often refereed to as aleatoric and epistemic, has recently been studied in the setting of supervised learning. In this paper, we propose to quantify these uncertainties with random forests. More specifically, we show how two general approaches for measuring the learner's aleatoric and epistemic uncertainty in a prediction can be instantiated with decision trees and random forests as learning algorithms in a classification setting. In this regard, we also compare random forests with deep neural networks, which have been used for a similar purpose.

preprint2020arXiv

Efficient Set-Valued Prediction in Multi-Class Classification

In cases of uncertainty, a multi-class classifier preferably returns a set of candidate classes instead of predicting a single class label with little guarantee. More precisely, the classifier should strive for an optimal balance between the correctness (the true class is among the candidates) and the precision (the candidates are not too many) of its prediction. We formalize this problem within a general decision-theoretic framework that unifies most of the existing work in this area. In this framework, uncertainty is quantified in terms of conditional class probabilities, and the quality of a predicted set is measured in terms of a utility function. We then address the problem of finding the Bayes-optimal prediction, i.e., the subset of class labels with highest expected utility. For this problem, which is computationally challenging as there are exponentially (in the number of classes) many predictions to choose from, we propose efficient algorithms that can be applied to a broad family of utility functions. Our theoretical results are complemented by experimental studies, in which we analyze the proposed algorithms in terms of predictive accuracy and runtime efficiency.

preprint2020arXiv

Learning Choice Functions via Pareto-Embeddings

We consider the problem of learning to choose from a given set of objects, where each object is represented by a feature vector. Traditional approaches in choice modelling are mainly based on learning a latent, real-valued utility function, thereby inducing a linear order on choice alternatives. While this approach is suitable for discrete (top-1) choices, it is not straightforward how to use it for subset choices. Instead of mapping choice alternatives to the real number line, we propose to embed them into a higher-dimensional utility space, in which we identify choice sets with Pareto-optimal points. To this end, we propose a learning algorithm that minimizes a differentiable loss function suitable for this task. We demonstrate the feasibility of learning a Pareto-embedding on a suite of benchmark datasets.

preprint2020arXiv

Learning Gradient Boosted Multi-label Classification Rules

In multi-label classification, where the evaluation of predictions is less straightforward than in single-label classification, various meaningful, though different, loss functions have been proposed. Ideally, the learning algorithm should be customizable towards a specific choice of the performance measure. Modern implementations of boosting, most prominently gradient boosted decision trees, appear to be appealing from this point of view. However, they are mostly limited to single-label classification, and hence not amenable to multi-label losses unless these are label-wise decomposable. In this work, we develop a generalization of the gradient boosting framework to multi-output problems and propose an algorithm for learning multi-label classification rules that is able to minimize decomposable as well as non-decomposable loss functions. Using the well-known Hamming loss and subset 0/1 loss as representatives, we analyze the abilities and limitations of our approach on synthetic data and evaluate its predictive performance on multi-label benchmarks.

preprint2020arXiv

Learning Tversky Similarity

In this paper, we advocate Tversky's ratio model as an appropriate basis for computational approaches to semantic similarity, that is, the comparison of objects such as images in a semantically meaningful way. We consider the problem of learning Tversky similarity measures from suitable training data indicating whether two objects tend to be similar or dissimilar. Experimentally, we evaluate our approach to similarity learning on two image datasets, showing that is performs very well compared to existing methods.

preprint2020arXiv

Machine Learning with the Sugeno Integral: The Case of Binary Classification

In this paper, we elaborate on the use of the Sugeno integral in the context of machine learning. More specifically, we propose a method for binary classification, in which the Sugeno integral is used as an aggregation function that combines several local evaluations of an instance, pertaining to different features or measurements, into a single global evaluation. Due to the specific nature of the Sugeno integral, this approach is especially suitable for learning from ordinal data, that is, when measurements are taken from ordinal scales. This is a topic that has not received much attention in machine learning so far. The core of the learning problem itself consists of identifying the capacity underlying the Sugeno integral. To tackle this problem, we develop an algorithm based on linear programming. The algorithm also includes a suitable technique for transforming the original feature values into local evaluations (local utility scores), as well as a method for tuning a threshold on the global evaluation. To control the flexibility of the classifier and mitigate the problem of overfitting the training data, we generalize our approach toward $k$-maxitive capacities, where $k$ plays the role of a hyper-parameter of the learner. We present experimental studies, in which we compare our method with competing approaches on several benchmark data sets.

preprint2020arXiv

Online Preselection with Context Information under the Plackett-Luce Model

We consider an extension of the contextual multi-armed bandit problem, in which, instead of selecting a single alternative (arm), a learner is supposed to make a preselection in the form of a subset of alternatives. More specifically, in each iteration, the learner is presented a set of arms and a context, both described in terms of feature vectors. The task of the learner is to preselect $k$ of these arms, among which a final choice is made in a second step. In our setup, we assume that each arm has a latent (context-dependent) utility, and that feedback on a preselection is produced according to a Plackett-Luce model. We propose the CPPL algorithm, which is inspired by the well-known UCB algorithm, and evaluate this algorithm on synthetic and real data. In particular, we consider an online algorithm selection scenario, which served as a main motivation of our problem setting. Here, an instance (which defines the context) from a certain problem class (such as SAT) can be solved by different algorithms (the arms), but only $k$ of these algorithms can actually be run.

preprint2020arXiv

Reliable Multi-label Classification: Prediction with Partial Abstention

In contrast to conventional (single-label) classification, the setting of multilabel classification (MLC) allows an instance to belong to several classes simultaneously. Thus, instead of selecting a single class label, predictions take the form of a subset of all labels. In this paper, we study an extension of the setting of MLC, in which the learner is allowed to partially abstain from a prediction, that is, to deliver predictions on some but not necessarily all class labels. We propose a formalization of MLC with abstention in terms of a generalized loss minimization problem and present first results for the case of the Hamming loss, rank loss, and F-measure, both theoretical and experimental.

preprint2020arXiv

Run2Survive: A Decision-theoretic Approach to Algorithm Selection based on Survival Analysis

Algorithm selection (AS) deals with the automatic selection of an algorithm from a fixed set of candidate algorithms most suitable for a specific instance of an algorithmic problem class, where "suitability" often refers to an algorithm's runtime. Due to possibly extremely long runtimes of candidate algorithms, training data for algorithm selection models is usually generated under time constraints in the sense that not all algorithms are run to completion on all instances. Thus, training data usually comprises censored information, as the true runtime of algorithms timed out remains unknown. However, many standard AS approaches are not able to handle such information in a proper way. On the other side, survival analysis (SA) naturally supports censored data and offers appropriate ways to use such data for learning distributional models of algorithm runtime, as we demonstrate in this work. We leverage such models as a basis of a sophisticated decision-theoretic approach to algorithm selection, which we dub Run2Survive. Moreover, taking advantage of a framework of this kind, we advocate a risk-averse approach to algorithm selection, in which the avoidance of a timeout is given high priority. In an extensive experimental study with the standard benchmark ASlib, our approach is shown to be highly competitive and in many cases even superior to state-of-the-art AS approaches.

preprint2020arXiv

Towards a Scalable and Flexible Simulation and Testing Environment Toolbox for Intelligent Microgrid Control

Micro- and smart grids (MSG) play an important role both for integrating renewable energy sources in conventional electricity grids and for providing power supply in remote areas. Modern MSGs are largely driven by power electronic converters due to their high efficiency and flexibility. Nevertheless, controlling MSGs is a challenging task due to highest requirements on energy availability, safety and voltage quality within a wide range of different MSG topologies. This results in a high demand for comprehensive testing of new control concepts during their development phase and comparisons with the state of the art in order to ensure their feasibility. This applies in particular to data-driven control approaches from the field of reinforcement learning (RL), whose stability and operating behavior can hardly be evaluated a priori. Therefore, the OpenModelica Microgrid Gym (OMG) package, an open-source software toolbox for the simulation and control optimization of MSGs, is proposed. It is capable of modeling and simulating arbitrary MSG topologies and offers a Python-based interface for plug \& play controller testing. In particular, the standardized OpenAI Gym interface allows for easy RL-based controller integration. Besides the presentation of the OMG toolbox, application examples are highlighted including safe Bayesian optimization for low-level controller tuning.

preprint2020arXiv

Towards Analogy-Based Explanations in Machine Learning

Principles of analogical reasoning have recently been applied in the context of machine learning, for example to develop new methods for classification and preference learning. In this paper, we argue that, while analogical reasoning is certainly useful for constructing new learning algorithms with high predictive accuracy, is is arguably not less interesting from an interpretability and explainability point of view. More specifically, we take the view that an analogy-based approach is a viable alternative to existing approaches in the realm of explainable AI and interpretable machine learning, and that analogy-based explanations of the predictions produced by a machine learning algorithm can complement similarity-based explanations in a meaningful way. To corroborate these claims, we outline the basic idea of an analogy-based explanation and illustrate its potential usefulness by means of some examples.