Source author record

Harald Binder

Harald Binder appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning Applications Artificial Intelligence Computation Cryptography and Security cs.CY Genomics

Catalog footprint

What is connected

8works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Eliciting associations between clinical variables from LLMs via comparison questions across populations

The training data of large language models (LLMs) comprises a wide range of biomedical literature, reflecting data from many different patient populations. We investigate how it might be possible to recover information on correlation and causal links between patient characteristics, as a key building block for medical decision making. To avoid the pitfalls of direct elicitation, we propose an approach based on structured comparison questions, specifically patient comparison triplet questions. This is combined with a statistical model for the LLM representation that provides estimates of correlations without access to activations or model internals. Intuitively, we consider how similarity decisions of LLMs based on a first variable are affected by providing information on a second variable for one of the patients being assessed. We then induce prompt-level environment shifts to obtain correlation estimates for different subpopulations, which enables an invariant causal prediction (ICP) approach to obtain conservative candidate parent links. We demonstrate the method in two clinical domains, chronic obstructive pulmonary disease (COPD) and multiple sclerosis (MS). Across prompted environments, the elicited correlations are smooth, stable, and clinically interpretable, yet vary in a statistically significant way that supports downstream invariance testing, such that ICP provides a small set of candidate invariant parent links. These results show that indirect elicitation via triplet comparisons can recover meaningful association structure from LLMs and offer a cautious route from implicit correlations to causal statements that are congruent with LLM answering patterns.

preprint2020arXiv

Deep generative models in DataSHIELD

The best way to calculate statistics from medical data is to use the data of individual patients. In some settings, this data is difficult to obtain due to privacy restrictions. In Germany, for example, it is not possible to pool routine data from different hospitals for research purposes without the consent of the patients. The DataSHIELD software provides an infrastructure and a set of statistical methods for joint analyses of distributed data. The contained algorithms are reformulated to work with aggregated data from the participating sites instead of the individual data. If a desired algorithm is not implemented in DataSHIELD or cannot be reformulated in such a way, using artificial data is an alternative. We present a methodology together with a software implementation that builds on DataSHIELD to create artificial data that preserve complex patterns from distributed individual patient data. Such data sets of artificial patients, which are not linked to real patients, can then be used for joint analyses. We use deep Boltzmann machines (DBMs) as generative models for capturing the distribution of data. For the implementation, we employ the package "BoltzmannMachines" from the Julia programming language and wrap it for use with DataSHIELD, which is based on R. As an exemplary application, we conduct a distributed analysis with DBMs on a synthetic data set, which simulates genetic variant data. Patterns from the original data can be recovered in the artificial data using hierarchical clustering of the virtual patients, demonstrating the feasibility of the approach. Our implementation adds to DataSHIELD the ability to generate artificial data that can be used for various analyses, e. g. for pattern recognition with deep learning. This also demonstrates more generally how DataSHIELD can be flexibly extended with advanced algorithms from languages other than R.

preprint2020arXiv

Is there a role for statistics in artificial intelligence?

The research on and application of artificial intelligence (AI) has triggered a comprehensive scientific, economic, social and political discussion. Here we argue that statistics, as an interdisciplinary scientific field, plays a substantial role both for the theoretical and practical understanding of AI and for its future development. Statistics might even be considered a core element of AI. With its specialist knowledge of data evaluation, starting with the precise formulation of the research question and passing through a study design stage on to analysis and interpretation of the results, statistics is a natural partner for other disciplines in teaching, research and practice. This paper aims at contributing to the current discussion by highlighting the relevance of statistical methodology in the context of AI development. In particular, we discuss contributions of statistics to the field of artificial intelligence concerning methodological development, planning and design of studies, assessment of data quality and data collection, differentiation of causality and associations and assessment of uncertainty in results. Moreover, the paper also deals with the equally necessary and meaningful extension of curricula in schools and universities.

preprint2020arXiv

Transfer learning of regression models from a sequence of datasets by penalized estimation

Transfer learning refers to the promising idea of initializing model fits based on pre-training on other data. We particularly consider regression modeling settings where parameter estimates from previous data can be used as anchoring points, yet may not be available for all parameters, thus covariance information cannot be reused. A procedure that updates through targeted penalized estimation, which shrinks the estimator towards a nonzero value, is presented. The parameter estimate from the previous data serves as this nonzero value when an update is sought from novel data. This naturally extends to a sequence of data sets with the same response, but potentially only partial overlap in covariates. The iteratively updated regression parameter estimator is shown to be asymptotically unbiased and consistent. The penalty parameter is chosen through constrained cross-validated loglikelihood optimization. The constraint bounds the amount of shrinkage of the updated estimator toward the current one from below. The bound aims to preserve the (updated) estimator's goodness-of-fit on all-but-the-novel data. The proposed approach is compared to other regression modeling procedures. Finally, it is illustrated on an epidemiological study where the data arrive in batches with different covariate-availability and the model is re-fitted with the availability of a novel batch.

preprint2019arXiv

Netboost: Boosting-supported network analysis improves high-dimensional omics prediction in acute myeloid leukemia and Huntington's disease

Background: State-of-the art selection methods fail to identify weak but cumulative effects of features found in many high-dimensional omics datasets. Nevertheless, these features play an important role in certain diseases. Results: We present Netboost, a three-step dimension reduction technique. First, a boosting-based filter is combined with the topological overlap measure to identify the essential edges of the network. Second, sparse hierarchical clustering is applied on the selected edges to identify modules and finally module information is aggregated by the first principal components. The primary analysis is than carried out on these summary measures instead of the original data. We demonstrate the application of the newly developed Netboost in combination with CoxBoost for survival prediction of DNA methylation and gene expression data from 180 acute myeloid leukemia (AML) patients and show, based on cross-validated prediction error curve estimates, its prediction superiority over variable selection on the full dataset as well as over an alternative clustering approach. The identified signature related to chromatin modifying enzymes was replicated in an independent dataset of AML patients in the phase II AMLSG 12-09 study. In a second application we combine Netboost with Random Forest classification and improve the disease classification error in RNA-sequencing data of Huntington's disease mice. Conclusion: Netboost improves definition of predictive variables for survival analysis and classification. It is a freely available Bioconductor R package for dimension reduction and hypothesis generation in high-dimensional omics applications.

preprint2014arXiv

Extending Statistical Boosting - An Overview of Recent Methodological Developments

Boosting algorithms to simultaneously estimate and select predictor effects in statistical models have gained substantial interest during the last decade. This review article aims to highlight recent methodological developments regarding boosting algorithms for statistical modelling especially focusing on topics relevant for biomedical research. We suggest a unified framework for gradient boosting and likelihood-based boosting (statistical boosting) which have been addressed strictly separated in the literature up to now. Statistical boosting algorithms have been adapted to carry out unbiased variable selection and automated model choice during the fitting process and can nowadays be applied in almost any possible type of regression setting in combination with a large amount of different types of predictor effects. The methodological developments on statistical boosting during the last ten years can be grouped into three different lines of research: (i) efforts to ensure variable selection leading to sparser models, (ii) developments regarding different types of predictor effects and their selection (model choice), (iii) approaches to extend the statistical boosting framework to new regression settings.

preprint2014arXiv

The Evolution of Boosting Algorithms - From Machine Learning to Statistical Modelling

The concept of boosting emerged from the field of machine learning. The basic idea is to boost the accuracy of a weak classifying tool by combining various instances into a more accurate prediction. This general concept was later adapted to the field of statistical modelling. This review article attempts to highlight this evolution of boosting algorithms from machine learning to statistical modelling. We describe the AdaBoost algorithm for classification as well as the two most prominent statistical boosting approaches, gradient boosting and likelihood-based boosting. Although both appraoches are typically treated separately in the literature, they share the same methodological roots and follow the same fundamental concepts. Compared to the initial machine learning algorithms, which must be seen as black-box prediction schemes, statistical boosting result in statistical models which offer a straight-forward interpretation. We highlight the methodological background and present the most common software implementations. Worked out examples and corresponding R code can be found in the Appendix.

preprint2010arXiv

A coordinate-wise optimization algorithm for the Fused Lasso

L1 -penalized regression methods such as the Lasso (Tibshirani 1996) that achieve both variable selection and shrinkage have been very popular. An extension of this method is the Fused Lasso (Tibshirani and Wang 2007), which allows for the incorporation of external information into the model. In this article, we develop new and fast algorithms for solving the Fused Lasso which are based on coordinate-wise optimization. This class of algorithms has recently been applied very successfully to solve L1 -penalized problems very quickly (Friedman et al. 2007). As a straightforward coordinate-wise procedure does not converge to the global optimum in general, we adapt it in two ways, using maximum-flow algorithms and a Huber penalty based approximation to the loss function. In a simulation study, we evaluate the speed of these algorithms and compare them to other standard methods. As the Huber-penalty based method is only approximate, we also evaluate its accuracy. Apart from this, we also extend the Fused Lasso to logistic as well as proportional hazards models and allow for a more flexible penalty structure.

Harald Binder

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Eliciting associations between clinical variables from LLMs via comparison questions across populations

Deep generative models in DataSHIELD

Is there a role for statistics in artificial intelligence?

Transfer learning of regression models from a sequence of datasets by penalized estimation

Netboost: Boosting-supported network analysis improves high-dimensional omics prediction in acute myeloid leukemia and Huntington's disease

Extending Statistical Boosting - An Overview of Recent Methodological Developments

The Evolution of Boosting Algorithms - From Machine Learning to Statistical Modelling

A coordinate-wise optimization algorithm for the Fused Lasso