Source author record

Fabian Scheipl

Fabian Scheipl appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Machine Learning Computation

Catalog footprint

What is connected

12works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A geometric framework for outlier detection in high-dimensional data

Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework that exploits the metric structure of a data set. Our approach rests on the manifold assumption, i.e., that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high-dimensional data. We also suggest a novel, mathematically precise, and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.

preprint2022arXiv

Enhancing cluster analysis via topological manifold learning

We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: theoretical arguments and empirical evidence show that clustering embedding vectors, representing the structure of a data manifold instead of the observed feature vectors themselves, is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how \textit{separable} the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. Our approach is successful because we perform the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.

preprint2021arXiv

Benchmarking time series classification -- Functional data vs machine learning approaches

Time series classification problems have drawn increasing attention in the machine learning and statistical community. Closely related is the field of functional data analysis (FDA): it refers to the range of problems that deal with the analysis of data that is continuously indexed over some domain. While often employing different methods, both fields strive to answer similar questions, a common example being classification or regression problems with functional covariates. We study methods from functional data analysis, such as functional generalized additive models, as well as functionality to concatenate (functional-) feature extraction or basis representations with traditional machine learning algorithms like support vector machines or classification trees. In order to assess the methods and implementations, we run a benchmark on a wide variety of representative (time series) data sets, with in-depth analysis of empirical results, and strive to provide a reference ranking for which method(s) to use for non-expert practitioners. Additionally, we provide a software framework in R for functional data analysis for supervised learning, including machine learning and more linear approaches from statistics. This allows convenient access, and in connection with the machine-learning toolbox mlr, those methods can now also be tuned and benchmarked.

preprint2021arXiv

Statistical Inference for Ordinal Predictors in Generalized Linear and Additive Models with Application to Bronchopulmonary Dysplasia

Discrete but ordered covariates are quite common in applied statistics, and some regularized fitting procedures have been proposed for proper handling of ordinal predictors in statistical modeling. In this study, we show how quadratic penalties on adjacent dummy coefficients of ordinal predictors proposed in the literature can be incorporated in the framework of generalized additive models, making tools for statistical inference developed there available for ordinal predictors as well. Motivated by an application from neonatal medicine, we discuss whether results obtained when constructing confidence intervals and testing significance of smooth terms in generalized additive models are useful with ordinal predictors/penalties as well.

preprint2018arXiv

A General Framework for Multivariate Functional Principal Component Analysis of Amplitude and Phase Variation

Functional data typically contains amplitude and phase variation. In many data situations, phase variation is treated as a nuisance effect and is removed during preprocessing, although it may contain valuable information. In this note, we focus on joint principal component analysis (PCA) of amplitude and phase variation. As the space of warping functions has a complex geometric structure, one key element of the analysis is transforming the warping functions to $L^2(\mathcal{T})$. We present different transformation approaches and show how they fit into a general class of transformations. This allows us to compare their strengths and limitations. In the context of PCA, our results offer arguments in favour of the centered log-ratio transformation. We further embed existing approaches from Hadjipantelis et al. (2015) and Lee and Jung (2017) for joint PCA of amplitude and phase variation into the framework of multivariate functional PCA, where we study the properties of the estimators based on an appropriate metric. The approach is illustrated through an application from seismology.

preprint2016arXiv

Fast symmetric additive covariance smoothing

We propose a fast bivariate smoothing approach for symmetric surfaces that has a wide range of applications. We show how it can be applied to estimate the covariance function in longitudinal data as well as multiple additive covariances in functional data with complex correlation structures. Our symmetric smoother can handle (possibly noisy) data sampled on a common, dense grid as well as irregularly or sparsely sampled data. Estimation is based on bivariate penalized spline smoothing using a mixed model representation and the symmetry is used to reduce computation time compared to the usual non-symmetric smoothers. We outline the application of our approach in functional principal component analysis and demonstrate its practical value in two applications. The approach is evaluated in extensive simulations. We provide documented open source software implementing our fast symmetric bivariate smoother building on established algorithms for additive models.

preprint2016arXiv

Generalized Functional Additive Mixed Models

We propose a comprehensive framework for additive regression models for non-Gaussian functional responses, allowing for multiple (partially) nested or crossed functional random effects with flexible correlation structures for, e.g., spatial, temporal, or longitudinal functional data as well as linear and nonlinear effects of functional and scalar covariates that may vary smoothly over the index of the functional response. Our implementation handles functional responses from any exponential family distribution as well as many others like Beta- or scaled non-central $t$-distributions. Development is motivated by and evaluated on an application to large-scale longitudinal feeding records of pigs. Results in extensive simulation studies as well as replications of two previously published simulation studies for generalized functional mixed models demonstrate the good performance of our proposal. The approach is implemented in well-documented open source software in the "pffr()" function in R-package "refund".

preprint2016arXiv

Identifiability in penalized function-on-function regression models

Regression models with functional responses and covariates constitute a powerful and increasingly important model class. However, regression with functional data poses well known and challenging problems of non-identifiability. This non-identifiability can manifest itself in arbitrarily large errors for coefficient surface estimates despite accurate predictions of the responses, thus invalidating substantial interpretations of the fitted models. We offer an accessible rephrasing of these identifiability issues in realistic applications of penalized linear function-on-function-regression and delimit the set of circumstances under which they are likely to occur in practice. Specifically, non-identifiability that persists under smoothness assumptions on the coefficient surface can occur if the functional covariate's empirical covariance has a kernel which overlaps that of the roughness penalty of the spline estimator. Extensive simulation studies validate the theoretical insights, explore the extent of the problem and allow us to evaluate their practical consequences under varying assumptions about the data generating processes. A case study illustrates the practical significance of the problem. Based on theoretical considerations and our empirical evaluation, we provide immediately applicable diagnostics for lack of identifiability and give recommendations for avoiding estimation artifacts in practice.

preprint2013arXiv

Functional Additive Mixed Models

We propose an extensive framework for additive regression models for correlated functional responses, allowing for multiple partially nested or crossed functional random effects with flexible correlation structures for, e.g., spatial, temporal, or longitudinal functional data. Additionally, our framework includes linear and nonlinear effects of functional and scalar covariates that may vary smoothly over the index of the functional response. It accommodates densely or sparsely observed functional responses and predictors which may be observed with additional error and includes both spline-based and functional principal component-based terms. Estimation and inference in this framework is based on standard additive mixed models, allowing us to take advantage of established methods and robust, flexible algorithms. We provide easy-to-use open source software in the pffr() function for the R-package refund. Simulations show that the proposed method recovers relevant effects reliably, handles small sample sizes well and also scales to larger data sets. Applications with spatially and longitudinally observed functional data demonstrate the flexibility in modeling and interpretability of results of our approach.

preprint2013arXiv

Penalized Likelihood and Bayesian Function Selection in Regression Models

Challenging research in various fields has driven a wide range of methodological advances in variable selection for regression models with high-dimensional predictors. In comparison, selection of nonlinear functions in models with additive predictors has been considered only more recently. Several competing suggestions have been developed at about the same time and often do not refer to each other. This article provides a state-of-the-art review on function selection, focusing on penalized likelihood and Bayesian concepts, relating various approaches to each other in a unified framework. In an empirical comparison, also including boosting, we evaluate several methods through applications to simulated and real data, thereby providing some guidance on their performance in practice.

preprint2011arXiv

Spike-and-Slab Priors for Function Selection in Structured Additive Regression Models

Structured additive regression provides a general framework for complex Gaussian and non-Gaussian regression models, with predictors comprising arbitrary combinations of nonlinear functions and surfaces, spatial effects, varying coefficients, random effects and further regression terms. The large flexibility of structured additive regression makes function selection a challenging and important task, aiming at (1) selecting the relevant covariates, (2) choosing an appropriate and parsimonious representation of the impact of covariates on the predictor and (3) determining the required interactions. We propose a spike-and-slab prior structure for function selection that allows to include or exclude single coefficients as well as blocks of coefficients representing specific model terms. A novel multiplicative parameter expansion is required to obtain good mixing and convergence properties in a Markov chain Monte Carlo simulation approach and is shown to induce desirable shrinkage properties. In simulation studies and with (real) benchmark classification data, we investigate sensitivity to hyperparameter settings and compare performance to competitors. The flexibility and applicability of our approach are demonstrated in an additive piecewise exponential model with time-varying effects for right-censored survival times of intensive care patients with sepsis. Geoadditive and additive mixed logit model applications are discussed in an extensive appendix.

preprint2011arXiv

spikeSlabGAM: Bayesian Variable Selection, Model Choice and Regularization for Generalized Additive Mixed Models in R

The R package spikeSlabGAM implements Bayesian variable selection, model choice, and regularized estimation in (geo-)additive mixed models for Gaussian, binomial, and Poisson responses. Its purpose is to (1) choose an appropriate subset of potential covariates and their interactions, (2) to determine whether linear or more flexible functional forms are required to model the effects of the respective covariates, and (3) to estimate their shapes. Selection and regularization of the model terms is based on a novel spike-and-slab-type prior on coefficient groups associated with parametric and semi-parametric effects.

Fabian Scheipl

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

A geometric framework for outlier detection in high-dimensional data

Enhancing cluster analysis via topological manifold learning

Benchmarking time series classification -- Functional data vs machine learning approaches

Statistical Inference for Ordinal Predictors in Generalized Linear and Additive Models with Application to Bronchopulmonary Dysplasia

A General Framework for Multivariate Functional Principal Component Analysis of Amplitude and Phase Variation

Fast symmetric additive covariance smoothing

Generalized Functional Additive Mixed Models

Identifiability in penalized function-on-function regression models

Functional Additive Mixed Models

Penalized Likelihood and Bayesian Function Selection in Regression Models

Spike-and-Slab Priors for Function Selection in Structured Additive Regression Models

spikeSlabGAM: Bayesian Variable Selection, Model Choice and Regularization for Generalized Additive Mixed Models in R