Researcher profile

Olaf Wolkenhauer

Olaf Wolkenhauer contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - Baseline
4works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2022arXiv

ConvGeN: Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets

Data is commonly stored in tabular format. Several fields of research are prone to small imbalanced tabular data. Supervised Machine Learning on such data is often difficult due to class imbalance. Synthetic data generation, i.e., oversampling, is a common remedy used to improve classifier performance. State-of-the-art linear interpolation approaches, such as LoRAS and ProWRAS can be used to generate synthetic samples from the convex space of the minority class to improve classifier performance in such cases. Deep generative networks are common deep learning approaches for synthetic sample generation, widely used for synthetic image generation. However, their scope on synthetic tabular data generation in the context of imbalanced classification is not adequately explored. In this article, we show that existing deep generative models perform poorly compared to linear interpolation based approaches for imbalanced classification problems on smaller tabular datasets. To overcome this, we propose a deep generative model, ConvGeN that combines the idea of convex space learning with deep generative models. ConvGeN learns the coefficients for the convex combinations of the minority class samples, such that the synthetic data is distinct enough from the majority class. Our benchmarking experiments demonstrate that our proposed model ConvGeN improves imbalanced classification on such small datasets, as compared to existing deep generative models, while being at-par with the existing linear interpolation approaches. Moreover, we discuss how our model can be used for synthetic tabular data generation in general, even outside the scope of data imbalance and thus, improves the overall applicability of convex space learning.

preprint2022arXiv

On the role of data, statistics and decisions in a pandemic

A pandemic poses particular challenges to decision-making because of the need to continuously adapt decisions to rapidly changing evidence and available data. For example, which countermeasures are appropriate at a particular stage of the pandemic? How can the severity of the pandemic be measured? What is the effect of vaccination in the population and which groups should be vaccinated first? The process of decision-making starts with data collection and modeling and continues to the dissemination of results and the subsequent decisions taken. The goal of this paper is to give an overview of this process and to provide recommendations for the different steps from a statistical perspective. In particular, we discuss a range of modeling techniques including mathematical, statistical and decision-analytic models along with their applications in the COVID-19 context. With this overview, we aim to foster the understanding of the goals of these modeling approaches and the specific data requirements that are essential for the interpretation of results and for successful interdisciplinary collaborations. A special focus is on the role played by data in these different models, and we incorporate into the discussion the importance of statistical literacy, and of effective dissemination and communication of findings.

preprint2020arXiv

LoRAS: An oversampling approach for imbalanced datasets

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

preprint2014arXiv

Accounting for Randomness in Measurement and Sampling in Study of Cancer Cell Population Dynamics

Studying the development of malignant tumours, it is important to know and predict the proportions of different cell types in tissue samples. Knowing the expected temporal evolution of the proportion of normal tissue cells, compared to stem-like and non-stem like cancer cells, gives an indication about the progression of the disease and indicates the expected response to interventions with drugs. Such processes have been modeled using Markov processes. An essential step for the simulation of such models is then the determination of state transition probabilities. We here consider the experimentally more realistic scenario in which the measurement of cell population sizes is noisy, leading to a particular hidden Markov model. In this context, randomness in measurement is related to noisy measurements, which are used for the estimation of the transition probability matrix. Randomness in sampling, on the other hand, is here related to the error in estimating the state probability from small cell populations. Using aggregated data of fluorescence-activated cell sorting (FACS) measurement, we develop a minimum mean square error estimator (MMSE) and maximum likelihood (ML) estimator and formulate two problems to find the minimum number of required samples and measurements to guarantee the accuracy of predicted population sizes using a transition probability matrix estimated from noisy data. We analyze the properties of two estimators for different noise distributions and prove an optimal solution for Gaussian distributions with the MMSE. Our numerical results show, that for noisy measurements the convergence mechanism of transition probabilities and steady states differ widely from the real values if one uses the standard deterministic approach in which measurements are assumed to be noise free.