Source author record

Bernhard C. Geiger

Bernhard C. Geiger appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Machine Learning math.PR nlin.SI Social and Information Networks Artificial Intelligence Computer Vision cond-mat.mtrl-sci physics.comp-ph physics.soc-ph

Catalog footprint

What is connected

25works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Information Plane Analysis of Binary Neural Networks

Information plane (IP) analysis has been suggested to study the training dynamics of deep neural networks through mutual information (MI) between inputs, representations, and targets. However, its statistical validity is often compromised by the difficulty of estimating MI from samples of high-dimensional, deterministic representations. In this work, we perform IP analyses on binary neural networks (BNNs) where activations are discrete and MI is finite. We characterise the finite-sample behaviour of the plug-in entropy estimator and identify regimes for sample size $N$ and representation dimensionality $D$ under which MI estimates are reliable. Outside these regimes, we show that empirical MI estimates saturate to $\log_2 N$, rendering IP trajectories uninformative. Restricting attention to the reliable regime, we train 375 BNNs to investigate the existence of late-stage compression phases and the relationship between compressed representations and generalisation performance. Our results show that while late-stage compression is frequently observed, compressed latent representations do not consistently correlate with improved generalization performance. Instead, the relationship between compression and generalisation is highly dependent on task, architecture, and regularisation.

preprint2026arXiv

Universal Outlier Hypothesis Testing via Mean- and Median-Based Tests

Universal outlier hypothesis testing refers to a hypothesis testing problem where one observes a large number of length-$n$ sequences -- the majority of which are distributed according to the typical distribution $π$ and a small number are distributed according to the outlier distribution $μ$ -- and one wishes to decide, which of these sequences are outliers without having knowledge of $π$ and $μ$. In contrast to previous works, in this paper it is assumed that both the number of observation sequences and the number of outlier sequences grow with the sequence length. In this case, the typical distribution $π$ can be estimated by computing the mean over all observation sequences, provided that the number of outlier sequences is sublinear in the total number of sequences. It is demonstrated that, in this case, one can achieve the error exponent of the maximum likelihood test that has access to both $π$ and $μ$. However, this mean-based test performs poorly when the number of outlier sequences is proportional to the total number of sequences. For this case, a median-based test is proposed that estimates $π$ as the median of all observation sequences. It is demonstrated that the median-based test achieves again the error exponent of the maximum likelihood test that has access to both $π$ and $μ$, but only with probability approaching one. To formalize this case, the typical error exponent -- similar to the typical random coding exponent introduced in the context of random coding for channel coding -- is proposed.

preprint2022arXiv

Generating Simple Directed Social Network Graphs for Information Spreading

Online social networks are a dominant medium in everyday life to stay in contact with friends and to share information. In Twitter, users can connect with other users by following them, who in turn can follow back. In recent years, researchers studied several properties of social networks and designed random graph models to describe them. Many of these approaches either focus on the generation of undirected graphs or on the creation of directed graphs without modeling the dependencies between reciprocal (i.e., two directed edges of opposite direction between two nodes) and directed edges. We propose an approach to generate directed social network graphs that creates reciprocal and directed edges and considers the correlation between the respective degree sequences. Our model relies on crawled directed graphs in Twitter, on which information w.r.t. a topic is exchanged or disseminated. While these graphs exhibit a high clustering coefficient and small average distances between random node pairs (which is typical in real-world networks), their degree sequences seem to follow a $χ^2$-distribution rather than power law. To achieve high clustering coefficients, we apply an edge rewiring procedure that preserves the node degrees. We compare the crawled and the created graphs, and simulate certain algorithms for information dissemination and epidemic spreading on them. The results show that the created graphs exhibit very similar topological and algorithmic properties as the real-world graphs, providing evidence that they can be used as surrogates in social network analysis. Furthermore, our model is highly scalable, which enables us to create graphs of arbitrary size with almost the same properties as the corresponding real-world networks.

preprint2022arXiv

Information-Theoretic Reduction of Markov Chains

We survey information-theoretic approaches to the reduction of Markov chains. Our survey is structured in two parts: The first part considers Markov chain coarse graining, which focuses on projecting the Markov chain to a process on a smaller state space that is informative}about certain quantities of interest. The second part considers Markov chain model reduction, which focuses on replacing the original Markov model by a simplified one that yields similar behavior as the original Markov model. We discuss the practical relevance of both approaches in the field of knowledge discovery and data mining by formulating problems of unsupervised machine learning as reduction problems of Markov chains. Finally, we briefly discuss the concept of lumpability, the phenomenon when a coarse graining yields a reduced Markov model.

preprint2022arXiv

Semi-Supervised Clustering via Information-Theoretic Markov Chain Aggregation

We connect the problem of semi-supervised clustering to constrained Markov aggregation, i.e., the task of partitioning the state space of a Markov chain. We achieve this connection by considering every data point in the dataset as an element of the Markov chain's state space, by defining the transition probabilities between states via similarities between corresponding data points, and by incorporating semi-supervision information as hard constraints in a Hartigan-style algorithm. The introduced Constrained Markov Clustering (CoMaC) is an extension of a recent information-theoretic framework for (unsupervised) Markov aggregation to the semi-supervised case. Instantiating CoMaC for certain parameter settings further generalizes two previous information-theoretic objectives for unsupervised clustering. Our results indicate that CoMaC is competitive with the state-of-the-art.

preprint2021arXiv

Importance of feature engineering and database selection in a machine learning model: A case study on carbon crystal structures

Drive towards improved performance of machine learning models has led to the creation of complex features representing a database of condensed matter systems. The complex features, however, do not offer an intuitive explanation on which physical attributes do improve the performance. The effect of the database on the performance of the trained model is often neglected. In this work we seek to understand in depth the effect that the choice of features and the properties of the database have on a machine learning application. In our experiments, we consider the complex phase space of carbon as a test case, for which we use a set of simple, human understandable and cheaply computable features for the aim of predicting the total energy of the crystal structure. Our study shows that (i) the performance of the machine learning model varies depending on the set of features and the database, (ii) is not transferable to every structure in the phase space and (iii) depends on how well structures are represented in the database.

preprint2021arXiv

Synwalk -- Community Detection via Random Walk Modelling

Complex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk builds upon a solid theoretical basis and detects communities by synthesizing the random walk induced by the given network from a class of candidate random walks. We thoroughly validate the effectiveness of our approach on synthetic and empirical networks, respectively, and compare Synwalk's performance with the performance of Infomap and Walktrap. Our results indicate that Synwalk performs robustly on networks with varying mixing parameters and degree distributions. We outperform Infomap on networks with high mixing parameter, and Infomap and Walktrap on networks with many small communities and low average degree. Our work has a potential to inspire further development of community detection via synthesis of random walks and we provide concrete ideas for future research.

preprint2020arXiv

A Formally Robust Time Series Distance Metric

Distance-based classification is among the most competitive classification methods for time series data. The most critical component of distance-based classification is the selected distance function. Past research has proposed various different distance metrics or measures dedicated to particular aspects of real-world time series data, yet there is an important aspect that has not been considered so far: Robustness against arbitrary data contamination. In this work, we propose a novel distance metric that is robust against arbitrarily "bad" contamination and has a worst-case computational complexity of $\mathcal{O}(n\log n)$. We formally argue why our proposed metric is robust, and demonstrate in an empirical evaluation that the metric yields competitive classification accuracy when applied in k-Nearest Neighbor time series classification.

preprint2020arXiv

SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder

We propose a semi-supervised generative model, SeGMA, which learns a joint probability distribution of data and their classes and which is implemented in a typical Wasserstein auto-encoder framework. We choose a mixture of Gaussians as a target distribution in latent space, which provides a natural splitting of data into clusters. To connect Gaussian components with correct classes, we use a small amount of labeled data and a Gaussian classifier induced by the target distribution. SeGMA is optimized efficiently due to the use of Cramer-Wold distance as a maximum mean discrepancy penalty, which yields a closed-form expression for a mixture of spherical Gaussian components and thus obviates the need of sampling. While SeGMA preserves all properties of its semi-supervised predecessors and achieves at least as good generative performance on standard benchmark data sets, it presents additional features: (a) interpolation between any pair of points in the latent space produces realistically-looking samples; (b) combining the interpolation property with disentangled class and style variables, SeGMA is able to perform a continuous style transfer from one class to another; (c) it is possible to change the intensity of class characteristics in a data point by moving the latent representation of the data point away from specific Gaussian components.

preprint2019arXiv

Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle

In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, either the IB functional is infinite for almost all values of network parameters, making the optimization problem ill-posed, or it is piecewise constant, hence not admitting gradient-based optimization methods. Second, the invariance of the IB functional under bijections prevents it from capturing properties of the learned representation that are desirable for classification, such as robustness and simplicity. We argue that these issues are partly resolved for stochastic DNNs, DNNs that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. We conclude that recent successes reported about training DNNs using the IB framework must be attributed to such solutions. As a side effect, our results indicate limitations of the IB framework for the analysis of DNNs. We also note that rather than trying to repair the inherent problems in the IB functional, a better approach may be to design regularizers on latent representation enforcing the desired properties directly.

preprint2016arXiv

A Rate-Distortion Approach to Caching

This paper takes a rate-distortion approach to understanding the information-theoretic laws governing cache-aided communications systems. Specifically, we characterise the optimal tradeoffs between the delivery rate, cache capacity and reconstruction distortions for a single-user problem and some special cases of a two-user problem. Our analysis considers discrete memoryless sources, expected- and excess-distortion constraints, and separable and f-separable distortion functions. We also establish a strong converse for separable-distortion functions, and we show that lossy versions of common information (Gács-Körner and Wyner) play an important role in caching. Finally, we illustrate and explicitly evaluate these laws for multivariate Gaussian sources and binary symmetric sources.

preprint2016arXiv

Graph-Based Lossless Markov Lumpings

We use results from zero-error information theory to determine the set of non-injective functions through which a Markov chain can be projected without losing information. These lumping functions can be found by clique partitioning of a graph related to the Markov chain. Lossless lumping is made possible by exploiting the (sufficiently sparse) temporal structure of the Markov chain. Eliminating edges in the transition graph of the Markov chain trades the required output alphabet size versus information loss, for which we present bounds.

preprint2016arXiv

Greedy Algorithms for Optimal Distribution Approximation

The approximation of a discrete probability distribution $\mathbf{t}$ by an $M$-type distribution $\mathbf{p}$ is considered. The approximation error is measured by the informational divergence $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, which is an appropriate measure, e.g., in the context of data compression. Properties of the optimal approximation are derived and bounds on the approximation error are presented, which are asymptotically tight. It is shown that $M$-type approximations that minimize either $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, or $\mathbb{D}(\mathbf{p}\Vert\mathbf{t})$, or the variational distance $\Vert\mathbf{p}-\mathbf{t}\Vert_1$ can all be found by using specific instances of the same general greedy algorithm.

preprint2015arXiv

Lumpings of Markov chains, entropy rate preservation, and higher-order lumpability

A lumping of a Markov chain is a coordinate-wise projection of the chain. We characterise the entropy rate preservation of a lumping of an aperiodic and irreducible Markov chain on a finite state space by the random growth rate of the cardinality of the realisable preimage of a finite-length trajectory of the lumped chain and by the information needed to reconstruct original trajectories from their lumped images. Both are purely combinatorial criteria, depending only on the transition graph of the Markov chain and the lumping function. A lumping is strongly k-lumpable, iff the lumped process is a k-th order Markov chain for each starting distribution of the original Markov chain. We characterise strong k-lumpability via tightness of stationary entropic bounds. In the sparse setting, we give sufficient conditions on the lumping to both preserve the entropy rate and be strongly k-lumpable.

preprint2014arXiv

Information Loss and Anti-Aliasing Filters in Multirate Systems

This work investigates the information loss in a decimation system, i.e., in a downsampler preceded by an anti-aliasing filter. It is shown that, without a specific signal model in mind, the anti-aliasing filter cannot reduce information loss, while, e.g., for a simple signal-plus-noise model it can. For the Gaussian case, the optimal anti-aliasing filter is shown to coincide with the one obtained from energetic considerations. For a non-Gaussian signal corrupted by Gaussian noise, the Gaussian assumption yields an upper bound on the information loss, justifying filter design principles based on second-order statistics from an information-theoretic point-of-view.

preprint2013arXiv

Influence of Doppler Bin Width on GPS Acquisition Probabilities

Acquisition is a search in two continuous dimensions, where the digital algorithms require a partitioning of the search space into cells. Depending on the partitioning of the Doppler frequency domain, more than one cell might contain significant signal energy. We present an expression for the expected values of the cells' energies to analyze the impact of the Doppler bin width on detection and false alarm probabilities.

preprint2013arXiv

Information Measures for Deterministic Input-Output Systems

In this work the information loss in deterministic, memoryless systems is investigated by evaluating the conditional entropy of the input random variable given the output random variable. It is shown that for a large class of systems the information loss is finite, even if the input is continuously distributed. Based on this finiteness, the problem of perfectly reconstructing the input is addressed and Fano-type bounds between the information loss and the reconstruction error probability are derived. For systems with infinite information loss a relative measure is defined and shown to be tightly related to Rényi information dimension. Employing another Fano-type argument, the reconstruction error probability is bounded by the relative information loss from below. In view of developing a system theory from an information-theoretic point-of-view, the theoretical results are illustrated by a few example systems, among them a multi-channel autocorrelation receiver.

preprint2013arXiv

Information-Preserving Markov Aggregation

We present a sufficient condition for a non-injective function of a Markov chain to be a second-order Markov chain with the same entropy rate as the original chain. This permits an information-preserving state space reduction by merging states or, equivalently, lossless compression of a Markov source on a sample-by-sample basis. The cardinality of the reduced state space is bounded from below by the node degrees of the transition graph associated with the original Markov chain. We also present an algorithm listing all possible information-preserving state space reductions, for a given transition graph. We illustrate our results by applying the algorithm to a bi-gram letter model of an English text.

preprint2013arXiv

On the Rate of Information Loss in Memoryless Systems

In this work we present results about the rate of (relative) information loss induced by passing a real-valued, stationary stochastic process through a memoryless system. We show that for a special class of systems the information loss rate is closely related to the difference of differential entropy rates of the input and output processes. It is further shown that the rate of (relative) information loss is bounded from above by the (relative) information loss the system induces on a random variable distributed according to the process's marginal distribution. As a side result, in this work we present sufficient conditions such that for a continuous-valued Markovian input process also the output process possesses the Markov property.

preprint2013arXiv

Signal Enhancement as Minimization of Relevant Information Loss

We introduce the notion of relevant information loss for the purpose of casting the signal enhancement problem in information-theoretic terms. We show that many algorithms from machine learning can be reformulated using relevant information loss, which allows their application to the aforementioned problem. As a particular example we analyze principle component analysis for dimensionality reduction, discuss its optimality, and show that the relevant information loss can indeed vanish if the relevant information is concentrated on a lower-dimensional subspace of the input space.

preprint2012arXiv

Relative Information Loss - An Introduction

We introduce a relative variant of information loss to characterize the behavior of deterministic input-output systems. We show that the relative loss is closely related to Renyi's information dimension. We provide an upper bound for continuous input random variables and an exact result for a class of functions (comprising quantizers) with infinite absolute information loss. A connection between relative information loss and reconstruction error is investigated.

preprint2012arXiv

Relative Information Loss in the PCA

In this work we analyze principle component analysis (PCA) as a deterministic input-output system. We show that the relative information loss induced by reducing the dimensionality of the data after performing the PCA is the same as in dimensionality reduction without PCA. Finally, we analyze the case where the PCA uses the sample covariance matrix to compute the rotation. If the rotation matrix is not available at the output, we show that an infinite amount of information is lost. The relative information loss is shown to decrease with increasing sample size.

preprint2011arXiv

Information Loss in Static Nonlinearities

In this work, conditional entropy is used to quantify the information loss induced by passing a continuous random variable through a memoryless nonlinear input-output system. We derive an expression for the information loss depending on the input density and the nonlinearity and show that the result is strongly related to the non-injectivity of the considered system. Tight upper bounds are presented, which can be evaluated with less difficulty than a direct evaluation of the information loss, which involves the logarithm of a sum. Application of our results is illustrated on a set of examples.

preprint2011arXiv

On the Information Loss in Memoryless Systems: The Multivariate Case

In this work we give a concise definition of information loss from a system-theoretic point of view. Based on this definition, we analyze the information loss in static input-output systems subject to a continuous-valued input. For a certain class of multiple-input, multiple-output systems the information loss is quantified. An interpretation of this loss is accompanied by upper bounds which are simple to evaluate. Finally, a class of systems is identified for which the information loss is necessarily infinite. Quantizers and limiters are shown to belong to this class.

preprint2011arXiv

Some Results on the Information Loss in Dynamical Systems

In this work we investigate the information loss in (nonlinear) dynamical input-output systems and provide some general results. In particular, we present an upper bound on the information loss rate, defined as the (non-negative) difference between the entropy rates of the jointly stationary stochastic processes at the input and output of the system. We further introduce a family of systems with vanishing information loss rate. It is shown that not only linear filters belong to that family, but - under certain circumstances - also finite-precision implementations of the latter, which typically consist of nonlinear elements.

Bernhard C. Geiger

What is connected

Connect this record

See the researcher in context

Building this map preview

25 published item(s)

Information Plane Analysis of Binary Neural Networks

Universal Outlier Hypothesis Testing via Mean- and Median-Based Tests

Generating Simple Directed Social Network Graphs for Information Spreading

Information-Theoretic Reduction of Markov Chains

Semi-Supervised Clustering via Information-Theoretic Markov Chain Aggregation

Importance of feature engineering and database selection in a machine learning model: A case study on carbon crystal structures

Synwalk -- Community Detection via Random Walk Modelling

A Formally Robust Time Series Distance Metric

SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder

Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle

A Rate-Distortion Approach to Caching

Graph-Based Lossless Markov Lumpings

Greedy Algorithms for Optimal Distribution Approximation

Lumpings of Markov chains, entropy rate preservation, and higher-order lumpability

Information Loss and Anti-Aliasing Filters in Multirate Systems

Influence of Doppler Bin Width on GPS Acquisition Probabilities

Information Measures for Deterministic Input-Output Systems

Information-Preserving Markov Aggregation

On the Rate of Information Loss in Memoryless Systems

Signal Enhancement as Minimization of Relevant Information Loss

Relative Information Loss - An Introduction

Relative Information Loss in the PCA

Information Loss in Static Nonlinearities

On the Information Loss in Memoryless Systems: The Multivariate Case

Some Results on the Information Loss in Dynamical Systems