Researcher profile

Christian Hennig

Christian Hennig contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2022arXiv

Clustering of football players based on performance data and aggregated clustering validity indexes

We analyse football (soccer) player performance data with mixed type variables from the 2014-15 season of eight European major leagues. We cluster these data based on a tailor-made dissimilarity measure. In order to decide between the many available clustering methods and to choose an appropriate number of clusters, we use the approach by Akhanli and Hennig (2020). This is based on several validation criteria that refer to different desirable characteristics of a clustering. These characteristics are chosen based on the aim of clustering, and this allows to define a suitable validation index as weighted average of calibrated individual indexes measuring the desirable features. We derive two different clusterings. The first one is a partition of the data set into major groups of essentially different players, which can be used for the analysis of a team's composition. The second one divides the data set into many small clusters (with 10 players on average), which can be used for finding players with a very similar profile to a given player. It is discussed in depth what characteristics are desirable for these clusterings. Weighting the criteria for the second clustering is informed by a survey of football experts.

preprint2022arXiv

Validation of cluster analysis results on validation data: A systematic framework

Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To judge the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic structured framework for validating clustering results on validation data that includes most existing validation approaches. In particular, we review classical validation techniques such as internal and external validation, stability analysis, hypothesis testing, and visual validation, and show how they can be interpreted in terms of our framework. We precisely define and formalise different types of validation of clustering results on a validation dataset and explain how each type can be implemented in practice. Furthermore, we give examples of how clustering studies from the applied literature that used a validation dataset can be classified into the framework.

preprint2021arXiv

An empirical comparison and characterisation of nine popular clustering methods

Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets come with a "true" clustering. On these data sets the similarity of the clusterings from the nine methods to the "true" clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the "true" clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover "true" clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given "true" clustering.

preprint2020arXiv

Cluster validation by measurement of clustering characteristics relevant to the user

There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria to compare clusterings. This can be used to select one of a set of clusterings from different methods, or from the same method ran with different parameters such as different numbers of clusters. There are many cluster validation indexes in the literature. Most of them attempt to measure the overall quality of a clustering by a single number, but this can be inappropriate. There are various different characteristics of a clustering that can be relevant in practice, depending on the aim of clustering, such as low within-cluster distances and high between-cluster separation. In this paper, a number of validation criteria will be introduced that refer to different desirable characteristics of a clustering, and that characterise a clustering in a multidimensional way. In specific applications the user may be interested in some of these criteria rather than others. A focus of the paper is on methodology to standardise the different characteristics so that users can aggregate them in a suitable way specifying weights for the various criteria that are relevant in the clustering application at hand.

preprint2020arXiv

Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are aiming at are supposed to have, among others within-cluster homogeneity, between-clusters separation, and stability. Here, a set of internal clustering validity indexes measuring different aspects of clustering quality is proposed, including some indexes from the literature. Users can choose the indexes that are relevant in the application at hand. In order to measure the overall quality of a clustering (for comparing clusterings from different methods and/or different numbers of clusters), the index values are calibrated for aggregation. Calibration is relative to a set of random clusterings on the same data. Two specific aggregated indexes are proposed and compared with existing indexes on simulated and real data.

preprint2020arXiv

Minkowski distances and standardisation for clustering and classification of high dimensional data

There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, $L_1$ (city block)-, $L_2$ (Euclidean)-, $L_3$-, $L_4$-, and maximum distances are combined with different schemes of standardisation of the variables before aggregating them. Boxplot transformation is proposed, a new transformation method for a single variable that standardises the majority of observations but brings outliers closer to the main bulk of the data. Distances are compared in simulations for clustering by partitioning around medoids, complete and average linkage, and classification by nearest neighbours, of data with a low number of observations but high dimensionality. The $L_1$-distance and the boxplot transformation show good results.

preprint2015arXiv

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed type data, temporal and spatial autocorrelation.

preprint2015arXiv

What are the true clusters?

Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becomes scientific not through uniqueness but through transparent and open communication. The idea of "natural kinds" is a human construct, but it highlights the human experience that the reality outside the observer's control seems to make certain distinctions between categories inevitable. Various desirable characteristics of clusterings and various approaches to define a context-dependent truth are listed, and I discuss what impact these ideas can have on the comparison of clustering methods, and the choice of a clustering methods and related decisions in practice.

preprint2013arXiv

Quantile-based classifiers

Quantile classifiers for potentially high-dimensional data are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample. It is shown that this is consistent, for $n \to \infty$, for the classification rule with asymptotically optimal quantile, and that, under some assumptions, for $p\to\infty$ the probability of correct classification converges to one. The role of skewness of the involved variables is discussed, which leads to an improved classifier. The optimal quantile classifier performs very well in a comprehensive simulation study and a real data set from chemistry (classification of bioaerosols) compared to nine other classifiers, including the support vector machine and the recently proposed median-based classifier (Hall et al., 2009), which inspired the quantile classifier.