Source author record

Andreas Mayr

Andreas Mayr appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Methodology Applications Biomolecules Computation Computer Science and Game Theory Artificial Intelligence cond-mat.mes-hall Genomics Neural and Evolutionary Computing physics.comp-ph Populations and Evolution Quantitative Methods

Catalog footprint

What is connected

17works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities

Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .

preprint2022arXiv

Boosting Distributional Copula Regression

Capturing complex dependence structures between outcome variables (e.g., study endpoints) is of high relevance in contemporary biomedical data problems and medical research. Distributional copula regression provides a flexible tool to model the joint distribution of multiple outcome variables by disentangling the marginal response distributions and their dependence structure. In a regression setup each parameter of the copula model, i.e. the marginal distribution parameters and the copula dependence parameters, can be related to covariates via structured additive predictors. We propose a framework to fit distributional copula regression models via a model-based boosting algorithm. Model-based boosting is a modern estimation technique that incorporates useful features like an intrinsic variable selection mechanism, parameter shrinkage and the capability to fit regression models in high dimensional data setting, i.e. situations with more covariates than observations. Thus, model-based boosting does not only complement existing Bayesian and maximum-likelihood based estimation frameworks for this model class but rather enables unique intrinsic mechanisms that can be helpful in many applied problems. The performance of our boosting algorithm in the context of copula regression models with continuous margins is evaluated in simulation studies that cover low- and high-dimensional data settings and situations with and without dependence between the responses. Moreover, distributional copula boosting is used to jointly analyze and predict the length and the weight of newborns conditional on sonographic measurements of the fetus before delivery together with other clinical variables.

preprint2022arXiv

Boosting Multivariate Structured Additive Distributional Regression Models

We develop a model-based boosting approach for multivariate distributional regression within the framework of generalized additive models for location, scale, and shape. Our approach enables the simultaneous modeling of all distribution parameters of an arbitrary parametric distribution of a multivariate response conditional on explanatory variables, while being applicable to potentially high-dimensional data. Moreover, the boosting algorithm incorporates data-driven variable selection, taking various different types of effects into account. As a special merit of our approach, it allows for modelling the association between multiple continuous or discrete outcomes through the relevant covariates. After a detailed simulation study investigating estimation and prediction performance, we demonstrate the full flexibility of our approach in three diverse biomedical applications. The first is based on high-dimensional genomic cohort data from the UK Biobank, considering a bivariate binary response (chronic ischemic heart disease and high cholesterol). Here, we are able to identify genetic variants that are informative for the association between cholesterol and heart disease. The second application considers the demand for health care in Australia with the number of consultations and the number of prescribed medications as a bivariate count response. The third application analyses two dimensions of childhood undernutrition in Nigeria as a bivariate response and we find that the correlation between the two undernutrition scores is considerably different depending on the child's age and the region the child lives in.

preprint2022arXiv

Deselection of Base-Learners for Statistical Boosting -- with an Application to Distributional Regression

We present a new procedure for enhanced variable selection for component-wise gradient boosting. Statistical boosting is a computational approach that emerged from machine learning, which allows to fit regression models in the presence of high-dimensional data. Furthermore, the algorithm can lead to data-driven variable selection. In practice, however, the final models typically tend to include too many variables in some situations. This occurs particularly for low-dimensional data (p<n), where we observe a slow overfitting behavior of boosting. As a result, more variables get included into the final model without altering the prediction accuracy. Many of these false positives are incorporated with a small coefficient and therefore have a small impact, but lead to a larger model. We try to overcome this issue by giving the algorithm the chance to deselect base-learners with minor importance. We analyze the impact of the new approach on variable selection and prediction performance in comparison to alternative methods including boosting with earlier stopping as well as twin boosting. We illustrate our approach with data of an ongoing cohort study for chronic kidney disease patients, where the most influential predictors for the health-related quality of life measure are selected in a distributional regression approach based on beta regression.

preprint2022arXiv

Redundancy-aware unsupervised ranking based on game theory -- application to gene enrichment analysis

Gene set collections are a common ground to study the enrichment of genes for specific phenotypic traits. Gene set enrichment analysis aims to identify genes that are over-represented in gene sets collections and might be associated with a specific phenotypic trait. However, as this involves a massive number of hypothesis testing, it is often questionable whether a pre-processing step to reduce gene sets collections' sizes is helpful. Moreover, the often highly overlapping gene sets and the consequent low interpretability of gene sets' collections demand for a reduction of the included gene sets. Inspired by this bioinformatics context, we propose a method to rank sets within a family of sets based on the distribution of the singletons and their size. We obtain sets' importance scores by computing Shapley values without incurring into the usual exponential number of evaluations of the value function. Moreover, we address the challenge of including a redundancy awareness in the rankings obtained where, in our case, sets are redundant if they show prominent intersections. We finally evaluate our approach for gene sets collections; the rankings obtained show low redundancy and high coverage of the genes. The unsupervised nature of the proposed ranking does not allow for an evident increase in the number of significant gene sets for specific phenotypic traits when reducing the size of the collections. However, we believe that the rankings proposed are of use in bioinformatics to increase interpretability of the gene sets collections and a step forward to include redundancy into Shapley values computations.

preprint2022arXiv

Unsupervised Features Ranking via Coalitional Game Theory for Categorical Data

Not all real-world data are labeled, and when labels are not available, it is often costly to obtain them. Moreover, as many algorithms suffer from the curse of dimensionality, reducing the features in the data to a smaller set is often of great utility. Unsupervised feature selection aims to reduce the number of features, often using feature importance scores to quantify the relevancy of single features to the task at hand. These scores can be based only on the distribution of variables and the quantification of their interactions. The previous literature, mainly investigating anomaly detection and clusters, fails to address the redundancy-elimination issue. We propose an evaluation of correlations among features to compute feature importance scores representing the contribution of single features in explaining the dataset's structure. Based on Coalitional Game Theory, our feature importance scores include a notion of redundancy awareness making them a tool to achieve redundancy-free feature selection. We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate while maximizing the information contained in the data. We also introduce an approximated version of the algorithm to reduce the complexity of Shapley values' computations.

preprint2021arXiv

Cross-Domain Few-Shot Learning by Representation Fusion

In order to quickly adapt to new data, few-shot learning aims at learning from few examples, often by using already acquired knowledge. The new data often differs from the previously seen data due to a domain shift, that is, a change of the input-target distribution. While several methods perform well on small domain shifts like new target classes with similar inputs, larger domain shifts are still challenging. Large domain shifts may result in high-level concepts that are not shared between the original and the new domain, whereas low-level concepts like edges in images might still be shared and useful. For cross-domain few-shot learning, we suggest representation fusion to unify different abstraction levels of a deep neural network into one representation. We propose Cross-domain Hebbian Ensemble Few-shot learning (CHEF), which achieves representation fusion by an ensemble of Hebbian learners acting on different layers of a deep neural network. Ablation studies show that representation fusion is a decisive factor to boost cross-domain few-shot learning. On the few-shot datasets miniImagenet and tieredImagenet with small domain shifts, CHEF is competitive with state-of-the-art methods. On cross-domain few-shot benchmark challenges with larger domain shifts, CHEF establishes novel state-of-the-art results in all categories. We further apply CHEF on a real-world cross-domain application in drug discovery. We consider a domain shift from bioactive molecules to environmental chemicals and drugs with twelve associated toxicity prediction tasks. On these tasks, that are highly relevant for computational drug discovery, CHEF significantly outperforms all its competitors. Github: https://github.com/ml-jku/chef

preprint2021arXiv

Estimating effective infection fatality rates during the course of the COVID-19 pandemic in Germany

The infection fatality rate (IFR) of the Coronavirus Disease 2019 (COVID-19) is one of the most discussed figures in the context of this pandemic. Using German COVID-19 surveillance data and age-group specific IFR estimates from multiple international studies, this work investigates time-dependent variations in effective IFR over the course of the pandemic. Three different methods for estimating (effective) IFRs are presented: (a) population-averaged IFRs based on the assumption that the infection risk is independent of age and time, (b) effective IFRs based on the assumption that the age distribution of confirmed cases approximately reflects the age distribution of infected individuals, and (c) effective IFRs accounting for age- and time-dependent dark figures of infections. Results show that effective IFRs in Germany are estimated to vary over time, as the age distributions of confirmed cases and estimated infections are changing during the course of the pandemic. In particular during the first and second waves of infections in spring and autumn/winter 2020, there has been a pronounced shift in the age distribution of confirmed cases towards older age groups, resulting in larger effective IFR estimates. The temporary increase in effective IFR during the first wave is estimated to be smaller but still remains when adjusting for age- and time-dependent dark figures. A comparison of effective IFRs with observed CFRs indicates that a substantial fraction of the time-dependent variability in observed mortality can be explained by changes in the age distribution of infections. Furthermore, a vanishing gap between effective IFRs and observed CFRs is apparent after the first infection wave, while a moderately increasing gap can be observed during the second wave. Further research is warranted to obtain timely age-stratified IFR estimates.

preprint2020arXiv

Large-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks

Due to the current severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, there is an urgent need for novel therapies and drugs. We conducted a large-scale virtual screening for small molecules that are potential CoV-2 inhibitors. To this end, we utilized "ChemAI", a deep neural network trained on more than 220M data points across 3.6M molecules from three public drug-discovery databases. With ChemAI, we screened and ranked one billion molecules from the ZINC database for favourable effects against CoV-2. We then reduced the result to the 30,000 top-ranked compounds, which are readily accessible and purchasable via the ZINC database. Additionally, we screened the DrugBank using ChemAI to allow for drug repurposing, which would be a fast way towards a therapy. We provide these top-ranked compounds of ZINC and DrugBank as a library for further screening with bioassays at https://github.com/ml-jku/sars-cov-inhibitors-chemai.

preprint2016arXiv

Boosting Joint Models for Longitudinal and Time-to-Event Data

Joint Models for longitudinal and time-to-event data have gained a lot of attention in the last few years as they are a helpful technique to approach common a data structure in clinical studies where longitudinal outcomes are recorded alongside event times. Those two processes are often linked and the two outcomes should thus be modeled jointly in order to prevent the potential bias introduced by independent modelling. Commonly, joint models are estimated in likelihood based expectation maximization or Bayesian approaches using frameworks where variable selection is problematic and which do not immediately work for high-dimensional data. In this paper, we propose a boosting algorithm tackling these challenges by being able to simultaneously estimate predictors for joint models and automatically select the most influential variables even in high-dimensional data situations. We analyse the performance of the new algorithm in a simulation study and apply it to the Danish cystic fibrosis registry which collects longitudinal lung function data on patients with cystic fibrosis together with data regarding the onset of pulmonary infections. This is the first approach to combine state-of-the art algorithms from the field of machine-learning with the model class of joint models, providing a fully data-driven mechanism to select variables and predictor effects in a unified framework of boosting joint models.

preprint2016arXiv

Signal Regression Models for Location, Scale and Shape with an Application to Stock Returns

We discuss scalar-on-function regression models where all parameters of the assumed response distribution can be modeled depending on covariates. We thus combine signal regression models with generalized additive models for location, scale and shape (GAMLSS). We compare two fundamentally different methods for estimation, a gradient boosting and a penalized likelihood based approach, and address practically important points like identifiability and model choice. Estimation by a component-wise gradient boosting algorithm allows for high dimensional data settings and variable selection. Estimation by a penalized likelihood based approach has the advantage of directly provided statistical inference. The motivating application is a time series of stock returns where it is of interest to model both the expectation and the variance depending on lagged response values and functional liquidity curves.

preprint2015arXiv

Toxicity Prediction using Deep Learning

Everyday we are exposed to various chemicals via food additives, cleaning and cosmetic products and medicines -- and some of them might be toxic. However testing the toxicity of all existing compounds by biological experiments is neither financially nor logistically feasible. Therefore the government agencies NIH, EPA and FDA launched the Tox21 Data Challenge within the "Toxicology in the 21st Century" (Tox21) initiative. The goal of this challenge was to assess the performance of computational methods in predicting the toxicity of chemical compounds. State of the art toxicity prediction methods build upon specifically-designed chemical descriptors developed over decades. Though Deep Learning is new to the field and was never applied to toxicity prediction before, it clearly outperformed all other participating methods. In this application paper we show that deep nets automatically learn features resembling well-established toxicophores. In total, our Deep Learning approach won both of the panel-challenges (nuclear receptors and stress response) as well as the overall Grand Challenge, and thereby sets a new standard in tox prediction.

preprint2014arXiv

Extending Statistical Boosting - An Overview of Recent Methodological Developments

Boosting algorithms to simultaneously estimate and select predictor effects in statistical models have gained substantial interest during the last decade. This review article aims to highlight recent methodological developments regarding boosting algorithms for statistical modelling especially focusing on topics relevant for biomedical research. We suggest a unified framework for gradient boosting and likelihood-based boosting (statistical boosting) which have been addressed strictly separated in the literature up to now. Statistical boosting algorithms have been adapted to carry out unbiased variable selection and automated model choice during the fitting process and can nowadays be applied in almost any possible type of regression setting in combination with a large amount of different types of predictor effects. The methodological developments on statistical boosting during the last ten years can be grouped into three different lines of research: (i) efforts to ensure variable selection leading to sparser models, (ii) developments regarding different types of predictor effects and their selection (model choice), (iii) approaches to extend the statistical boosting framework to new regression settings.

preprint2014arXiv

gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework

Generalized additive models for location, scale and shape (GAMLSS) are a flexible class of regression models that allow to model multiple parameters of a distribution function, such as the mean and the standard deviation, simultaneously. With the R package gamboostLSS, we provide a boosting method to fit these models. Variable selection and model choice are naturally available within this regularized regression framework. To introduce and illustrate the R package gamboostLSS and its infrastructure, we use a data set on stunted growth in India. In addition to the specification and application of the model itself, we present a variety of convenience functions, including methods for tuning parameter selection, prediction and visualization of results. The package gamboostLSS is available from CRAN (http://cran.r-project.org/package=gamboostLSS).

preprint2014arXiv

The Evolution of Boosting Algorithms - From Machine Learning to Statistical Modelling

The concept of boosting emerged from the field of machine learning. The basic idea is to boost the accuracy of a weak classifying tool by combining various instances into a more accurate prediction. This general concept was later adapted to the field of statistical modelling. This review article attempts to highlight this evolution of boosting algorithms from machine learning to statistical modelling. We describe the AdaBoost algorithm for classification as well as the two most prominent statistical boosting approaches, gradient boosting and likelihood-based boosting. Although both appraoches are typically treated separately in the literature, they share the same methodological roots and follow the same fundamental concepts. Compared to the initial machine learning algorithms, which must be seen as black-box prediction schemes, statistical boosting result in statistical models which offer a straight-forward interpretation. We highlight the methodological background and present the most common software implementations. Worked out examples and corresponding R code can be found in the Appendix.

preprint2013arXiv

Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations

The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are only suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discrimatory power of a prediction rule. Specifically, we propose a component-wise boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.

preprint2012arXiv

Design and Simulation of Molecular Nonvolatile Single-Electron Resistive Switches

We have carried out a preliminary design and simulation of a single-electron resistive switch based on a system of two linear, parallel, electrostatically-coupled molecules: one implementing a single-electron transistor and another serving as a single-electron trap. To verify our design, we have performed a theoretical analysis of this "memristive" device, based on a combination of ab-initio calculations of the electronic structures of the molecules and the general theory of single-electron tunneling in systems with discrete energy spectra. Our results show that such molecular assemblies, with a length below 10 nm and a footprint area of about 5 nm$^2$, may combine sub-second switching times with multi-year retention times and high ($> 10^3$) ON/OFF current ratios, at room temperature. Moreover, Monte Carlo simulations of self-assembled monolayers (SAM) based on such molecular assemblies have shown that such monolayers may also be used as resistive switches, with comparable characteristics and, in addition, be highly tolerant to defects and stray offset charges.

Andreas Mayr

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities

Boosting Distributional Copula Regression

Boosting Multivariate Structured Additive Distributional Regression Models

Deselection of Base-Learners for Statistical Boosting -- with an Application to Distributional Regression

Redundancy-aware unsupervised ranking based on game theory -- application to gene enrichment analysis

Unsupervised Features Ranking via Coalitional Game Theory for Categorical Data

Cross-Domain Few-Shot Learning by Representation Fusion

Estimating effective infection fatality rates during the course of the COVID-19 pandemic in Germany

Large-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks

Boosting Joint Models for Longitudinal and Time-to-Event Data

Signal Regression Models for Location, Scale and Shape with an Application to Stock Returns

Toxicity Prediction using Deep Learning

Extending Statistical Boosting - An Overview of Recent Methodological Developments

gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework

The Evolution of Boosting Algorithms - From Machine Learning to Statistical Modelling

Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations

Design and Simulation of Molecular Nonvolatile Single-Electron Resistive Switches