Researcher profile

Bernhard C. Geiger

Bernhard C. Geiger contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

Information Plane Analysis of Binary Neural Networks

Information plane (IP) analysis has been suggested to study the training dynamics of deep neural networks through mutual information (MI) between inputs, representations, and targets. However, its statistical validity is often compromised by the difficulty of estimating MI from samples of high-dimensional, deterministic representations. In this work, we perform IP analyses on binary neural networks (BNNs) where activations are discrete and MI is finite. We characterise the finite-sample behaviour of the plug-in entropy estimator and identify regimes for sample size $N$ and representation dimensionality $D$ under which MI estimates are reliable. Outside these regimes, we show that empirical MI estimates saturate to $\log_2 N$, rendering IP trajectories uninformative. Restricting attention to the reliable regime, we train 375 BNNs to investigate the existence of late-stage compression phases and the relationship between compressed representations and generalisation performance. Our results show that while late-stage compression is frequently observed, compressed latent representations do not consistently correlate with improved generalization performance. Instead, the relationship between compression and generalisation is highly dependent on task, architecture, and regularisation.

preprint2026arXiv

Universal Outlier Hypothesis Testing via Mean- and Median-Based Tests

Universal outlier hypothesis testing refers to a hypothesis testing problem where one observes a large number of length-$n$ sequences -- the majority of which are distributed according to the typical distribution $π$ and a small number are distributed according to the outlier distribution $μ$ -- and one wishes to decide, which of these sequences are outliers without having knowledge of $π$ and $μ$. In contrast to previous works, in this paper it is assumed that both the number of observation sequences and the number of outlier sequences grow with the sequence length. In this case, the typical distribution $π$ can be estimated by computing the mean over all observation sequences, provided that the number of outlier sequences is sublinear in the total number of sequences. It is demonstrated that, in this case, one can achieve the error exponent of the maximum likelihood test that has access to both $π$ and $μ$. However, this mean-based test performs poorly when the number of outlier sequences is proportional to the total number of sequences. For this case, a median-based test is proposed that estimates $π$ as the median of all observation sequences. It is demonstrated that the median-based test achieves again the error exponent of the maximum likelihood test that has access to both $π$ and $μ$, but only with probability approaching one. To formalize this case, the typical error exponent -- similar to the typical random coding exponent introduced in the context of random coding for channel coding -- is proposed.

preprint2022arXiv

Generating Simple Directed Social Network Graphs for Information Spreading

Online social networks are a dominant medium in everyday life to stay in contact with friends and to share information. In Twitter, users can connect with other users by following them, who in turn can follow back. In recent years, researchers studied several properties of social networks and designed random graph models to describe them. Many of these approaches either focus on the generation of undirected graphs or on the creation of directed graphs without modeling the dependencies between reciprocal (i.e., two directed edges of opposite direction between two nodes) and directed edges. We propose an approach to generate directed social network graphs that creates reciprocal and directed edges and considers the correlation between the respective degree sequences. Our model relies on crawled directed graphs in Twitter, on which information w.r.t. a topic is exchanged or disseminated. While these graphs exhibit a high clustering coefficient and small average distances between random node pairs (which is typical in real-world networks), their degree sequences seem to follow a $χ^2$-distribution rather than power law. To achieve high clustering coefficients, we apply an edge rewiring procedure that preserves the node degrees. We compare the crawled and the created graphs, and simulate certain algorithms for information dissemination and epidemic spreading on them. The results show that the created graphs exhibit very similar topological and algorithmic properties as the real-world graphs, providing evidence that they can be used as surrogates in social network analysis. Furthermore, our model is highly scalable, which enables us to create graphs of arbitrary size with almost the same properties as the corresponding real-world networks.

preprint2022arXiv

Information-Theoretic Reduction of Markov Chains

We survey information-theoretic approaches to the reduction of Markov chains. Our survey is structured in two parts: The first part considers Markov chain coarse graining, which focuses on projecting the Markov chain to a process on a smaller state space that is informative}about certain quantities of interest. The second part considers Markov chain model reduction, which focuses on replacing the original Markov model by a simplified one that yields similar behavior as the original Markov model. We discuss the practical relevance of both approaches in the field of knowledge discovery and data mining by formulating problems of unsupervised machine learning as reduction problems of Markov chains. Finally, we briefly discuss the concept of lumpability, the phenomenon when a coarse graining yields a reduced Markov model.

preprint2022arXiv

Semi-Supervised Clustering via Information-Theoretic Markov Chain Aggregation

We connect the problem of semi-supervised clustering to constrained Markov aggregation, i.e., the task of partitioning the state space of a Markov chain. We achieve this connection by considering every data point in the dataset as an element of the Markov chain's state space, by defining the transition probabilities between states via similarities between corresponding data points, and by incorporating semi-supervision information as hard constraints in a Hartigan-style algorithm. The introduced Constrained Markov Clustering (CoMaC) is an extension of a recent information-theoretic framework for (unsupervised) Markov aggregation to the semi-supervised case. Instantiating CoMaC for certain parameter settings further generalizes two previous information-theoretic objectives for unsupervised clustering. Our results indicate that CoMaC is competitive with the state-of-the-art.

preprint2021arXiv

Importance of feature engineering and database selection in a machine learning model: A case study on carbon crystal structures

Drive towards improved performance of machine learning models has led to the creation of complex features representing a database of condensed matter systems. The complex features, however, do not offer an intuitive explanation on which physical attributes do improve the performance. The effect of the database on the performance of the trained model is often neglected. In this work we seek to understand in depth the effect that the choice of features and the properties of the database have on a machine learning application. In our experiments, we consider the complex phase space of carbon as a test case, for which we use a set of simple, human understandable and cheaply computable features for the aim of predicting the total energy of the crystal structure. Our study shows that (i) the performance of the machine learning model varies depending on the set of features and the database, (ii) is not transferable to every structure in the phase space and (iii) depends on how well structures are represented in the database.

preprint2021arXiv

Synwalk -- Community Detection via Random Walk Modelling

Complex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk builds upon a solid theoretical basis and detects communities by synthesizing the random walk induced by the given network from a class of candidate random walks. We thoroughly validate the effectiveness of our approach on synthetic and empirical networks, respectively, and compare Synwalk's performance with the performance of Infomap and Walktrap. Our results indicate that Synwalk performs robustly on networks with varying mixing parameters and degree distributions. We outperform Infomap on networks with high mixing parameter, and Infomap and Walktrap on networks with many small communities and low average degree. Our work has a potential to inspire further development of community detection via synthesis of random walks and we provide concrete ideas for future research.

preprint2020arXiv

A Formally Robust Time Series Distance Metric

Distance-based classification is among the most competitive classification methods for time series data. The most critical component of distance-based classification is the selected distance function. Past research has proposed various different distance metrics or measures dedicated to particular aspects of real-world time series data, yet there is an important aspect that has not been considered so far: Robustness against arbitrary data contamination. In this work, we propose a novel distance metric that is robust against arbitrarily "bad" contamination and has a worst-case computational complexity of $\mathcal{O}(n\log n)$. We formally argue why our proposed metric is robust, and demonstrate in an empirical evaluation that the metric yields competitive classification accuracy when applied in k-Nearest Neighbor time series classification.

preprint2020arXiv

SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder

We propose a semi-supervised generative model, SeGMA, which learns a joint probability distribution of data and their classes and which is implemented in a typical Wasserstein auto-encoder framework. We choose a mixture of Gaussians as a target distribution in latent space, which provides a natural splitting of data into clusters. To connect Gaussian components with correct classes, we use a small amount of labeled data and a Gaussian classifier induced by the target distribution. SeGMA is optimized efficiently due to the use of Cramer-Wold distance as a maximum mean discrepancy penalty, which yields a closed-form expression for a mixture of spherical Gaussian components and thus obviates the need of sampling. While SeGMA preserves all properties of its semi-supervised predecessors and achieves at least as good generative performance on standard benchmark data sets, it presents additional features: (a) interpolation between any pair of points in the latent space produces realistically-looking samples; (b) combining the interpolation property with disentangled class and style variables, SeGMA is able to perform a continuous style transfer from one class to another; (c) it is possible to change the intensity of class characteristics in a data point by moving the latent representation of the data point away from specific Gaussian components.

preprint2019arXiv

Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle

In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, either the IB functional is infinite for almost all values of network parameters, making the optimization problem ill-posed, or it is piecewise constant, hence not admitting gradient-based optimization methods. Second, the invariance of the IB functional under bijections prevents it from capturing properties of the learned representation that are desirable for classification, such as robustness and simplicity. We argue that these issues are partly resolved for stochastic DNNs, DNNs that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. We conclude that recent successes reported about training DNNs using the IB framework must be attributed to such solutions. As a side effect, our results indicate limitations of the IB framework for the analysis of DNNs. We also note that rather than trying to repair the inherent problems in the IB functional, a better approach may be to design regularizers on latent representation enforcing the desired properties directly.

preprint2012arXiv

Relative Information Loss - An Introduction

We introduce a relative variant of information loss to characterize the behavior of deterministic input-output systems. We show that the relative loss is closely related to Renyi's information dimension. We provide an upper bound for continuous input random variables and an exact result for a class of functions (comprising quantizers) with infinite absolute information loss. A connection between relative information loss and reconstruction error is investigated.

preprint2011arXiv

Information Loss in Static Nonlinearities

In this work, conditional entropy is used to quantify the information loss induced by passing a continuous random variable through a memoryless nonlinear input-output system. We derive an expression for the information loss depending on the input density and the nonlinearity and show that the result is strongly related to the non-injectivity of the considered system. Tight upper bounds are presented, which can be evaluated with less difficulty than a direct evaluation of the information loss, which involves the logarithm of a sum. Application of our results is illustrated on a set of examples.

preprint2011arXiv

On the Information Loss in Memoryless Systems: The Multivariate Case

In this work we give a concise definition of information loss from a system-theoretic point of view. Based on this definition, we analyze the information loss in static input-output systems subject to a continuous-valued input. For a certain class of multiple-input, multiple-output systems the information loss is quantified. An interpretation of this loss is accompanied by upper bounds which are simple to evaluate. Finally, a class of systems is identified for which the information loss is necessarily infinite. Quantizers and limiters are shown to belong to this class.

preprint2011arXiv

Some Results on the Information Loss in Dynamical Systems

In this work we investigate the information loss in (nonlinear) dynamical input-output systems and provide some general results. In particular, we present an upper bound on the information loss rate, defined as the (non-negative) difference between the entropy rates of the jointly stationary stochastic processes at the input and output of the system. We further introduce a family of systems with vanishing information loss rate. It is shown that not only linear filters belong to that family, but - under certain circumstances - also finite-precision implementations of the latter, which typically consist of nonlinear elements.