Source author record

Gesine Reinert

Gesine Reinert appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.PR Machine Learning math.ST Statistics Theory Methodology Social and Information Networks q-fin.RM Applications Artificial Intelligence Computation Genomics math.AT math.OC Neurons and Cognition physics.soc-ph q-fin.ST

Catalog footprint

What is connected

24works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Kernelised Stein Statistic for Assessing Implicit Generative Models

Synthetic data generation has become a key ingredient for training machine learning procedures, addressing tasks such as data augmentation, analysing privacy-sensitive data, or visualising representative samples. Assessing the quality of such synthetic data generators hence has to be addressed. As (deep) generative models for synthetic data often do not admit explicit probability distributions, classical statistical procedures for assessing model goodness-of-fit may not be applicable. In this paper, we propose a principled procedure to assess the quality of a synthetic data generator. The procedure is a kernelised Stein discrepancy (KSD)-type test which is based on a non-parametric Stein operator for the synthetic data generator of interest. This operator is estimated from samples which are obtained from the synthetic data generator and hence can be applied even when the model is only implicit. In contrast to classical testing, the sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed. Experimental results on synthetic distributions and trained generative models on synthetic and real datasets illustrate that the method shows improved power performance compared to existing approaches.

preprint2022arXiv

Bounds for the chi-square approximation of Friedman's statistic by Stein's method

Friedman's chi-square test is a non-parametric statistical test for $r\geq2$ treatments across $n\ge1$ trials to assess the null hypothesis that there is no treatment effect. We use Stein's method with an exchangeable pair coupling to derive an explicit bound on the distance between the distribution of Friedman's statistic and its limiting chi-square distribution, measured using smooth test functions. Our bound is of the optimal order $n^{-1}$, and also has an optimal dependence on the parameter $r$, in that the bound tends to zero if and only if $r/n\rightarrow0$. From this bound, we deduce a Kolmogorov distance bound that decays to zero under the weaker condition $r^{1/2}/n\rightarrow0$.

preprint2022arXiv

GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks

Recovering global rankings from pairwise comparisons has wide applications from time synchronization to sports team ranking. Pairwise comparisons corresponding to matches in a competition can be construed as edges in a directed graph (digraph), whose nodes represent e.g. competitors with an unknown rank. In this paper, we introduce neural networks into the ranking recovery problem by proposing the so-called GNNRank, a trainable GNN-based framework with digraph embedding. Moreover, new objectives are devised to encode ranking upsets/violations. The framework involves a ranking score estimation approach, and adds an inductive bias by unfolding the Fiedler vector computation of the graph constructed from a learnable similarity matrix. Experimental results on extensive data sets show that our methods attain competitive and often superior performance against baselines, as well as showing promising transfer ability. Codes and preprocessed data are at: \url{https://github.com/SherylHYX/GNNRank}.

preprint2022arXiv

Lead-lag detection and network clustering for multivariate time series with an application to the US equity market

In multivariate time series systems, it has been observed that certain groups of variables partially lead the evolution of the system, while other variables follow this evolution with a time delay; the result is a lead-lag structure amongst the time series variables. In this paper, we propose a method for the detection of lead-lag clusters of time series in multivariate systems. We demonstrate that the web of pairwise lead-lag relationships between time series can be helpfully construed as a directed network, for which there exist suitable algorithms for the detection of pairs of lead-lag clusters with high pairwise imbalance. Within our framework, we consider a number of choices for the pairwise lead-lag metric and directed network clustering components. Our framework is validated on both a synthetic generative model for multivariate lead-lag time series systems and daily real-world US equity prices data. We showcase that our method is able to detect statistically significant lead-lag clusters in the US equity market. We study the nature of these clusters in the context of the empirical finance literature on lead-lag relations and demonstrate how these can be used for the construction of predictive financial signals.

preprint2022arXiv

Multivariate Central Limit Theorems for Random Clique Complexes

Motivated by open problems in applied and computational algebraic topology, we establish multivariate normal approximation theorems for three random vectors which arise organically in the study of random clique complexes. These are: (1) the vector of critical simplex counts attained by a lexicographical Morse matching, (2) the vector of simplex counts in the link of a fixed simplex, and (3) the vector of total simplex counts. The first of these random vectors forms a cornerstone of modern homology algorithms, while the second one provides a natural generalisation for the notion of vertex degree, and the third one may be viewed from the perspective of U-statistics. To obtain distributional approximations for these random vectors, we extend the notion of dissociated sums to a multivariate setting and prove a new central limit theorem for such sums using Stein's method.

preprint2022arXiv

Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics

As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.

preprint2022arXiv

Relaxing the Gaussian assumption in Shrinkage and SURE in high dimension

Shrinkage estimation is a fundamental tool of modern statistics, pioneered by Charles Stein upon his discovery of the famous paradox involving the multivariate Gaussian. A large portion of the subsequent literature only considers the efficiency of shrinkage, and that of an associated procedure known as Stein's Unbiased Risk Estimate, or SURE, in the Gaussian setting of that original work. We investigate what extensions to the domain of validity of shrinkage and SURE can be made away from the Gaussian through the use of tools developed in the probabilistic area now known as Stein's method. We show that shrinkage is efficient away from the Gaussian under very mild conditions on the distribution of the noise. SURE is also proved to be adaptive under similar assumptions, and in particular in a way that retains the classical asymptotics of Pinsker's theorem. Notably, shrinkage and SURE are shown to be efficient under mild distributional assumptions, and particularly for general isotropic log-concave measures.

preprint2022arXiv

SSSNET: Semi-Supervised Signed Network Clustering

Node embeddings are a powerful tool in the analysis of networks; yet, their full potential for the important task of node clustering has not been fully exploited. In particular, most state-of-the-art methods generating node embeddings of signed networks focus on link sign prediction, and those that pertain to node clustering are usually not graph neural network (GNN) methods. Here, we introduce a novel probabilistic balanced normalized cut loss for training nodes in a GNN framework for semi-supervised signed network clustering, called SSSNET. The method is end-to-end in combining embedding generation and clustering without an intermediate step; it has node clustering as main focus, with an emphasis on polarization effects arising in networks. The main novelty of our approach is a new take on the role of social balance theory for signed network embeddings. The standard heuristic for justifying the criteria for the embeddings hinges on the assumption that "an enemy's enemy is a friend". Here, instead, a neutral stance is assumed on whether or not the enemy of an enemy is a friend. Experimental results on various data sets, including a synthetic signed stochastic block model, a polarized version of it, and real-world data at different scales, demonstrate that SSSNET can achieve comparable or better results than state-of-the-art spectral clustering methods, for a wide range of noise and sparsity levels. SSSNET complements existing methods through the possibility of including exogenous information, in the form of node-level features or labels.

preprint2022arXiv

Stein's Method Meets Computational Statistics: A Review of Some Recent Developments

Stein's method compares probability distributions through the study of a class of linear operators called Stein operators. While mainly studied in probability and used to underpin theoretical statistics, Stein's method has led to significant advances in computational statistics in recent years. The goal of this survey is to bring together some of these recent developments and, in doing so, to stimulate further research into the successful field of Stein's method and statistics. The topics we discuss include tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, control variate techniques, parameter estimation and goodness-of-fit testing.

preprint2021arXiv

A Stein Goodness of fit Test for Exponential Random Graph Models

We propose and analyse a novel nonparametric goodness of fit testing procedure for exchangeable exponential random graph models (ERGMs) when a single network realisation is observed. The test determines how likely it is that the observation is generated from a target unnormalised ERGM density. Our test statistics are derived from a kernel Stein discrepancy, a divergence constructed via Steins method using functions in a reproducing kernel Hilbert space, combined with a discrete Stein operator for ERGMs. The test is a Monte Carlo test based on simulated networks from the target ERGM. We show theoretical properties for the testing procedure for a class of ERGMs. Simulation studies and real network applications are presented.

preprint2020arXiv

Ruin probabilities for risk processes in a bipartite network

This paper studies risk balancing features in an insurance market by evaluating ruin probabilities for single and multiple components of a multivariate compound Poisson risk process. The dependence of the components of the process is induced by a random bipartite network. In analogy with the non-network scenario, a network ruin parameter is introduced. This random parameter, which depends on the bipartite network, is crucial for the ruin probabilities. Under certain conditions on the network and for light-tailed claim size distributions we obtain Lundberg bounds and, for exponential claim size distributions, exact results for the ruin probabilities. For large sparse networks, the network ruin parameter is approximated by a function of independent Poisson variables. T

preprint2016arXiv

Bounds for the normal approximation of the maximum likelihood estimator

While the asymptotic normality of the maximum likelihood estimator under regularity conditions is long established, this paper derives explicit bounds for the bounded Wasserstein distance between the distribution of the maximum likelihood estimator (MLE) and the normal distribution. For this task, we employ Stein's method. We focus on independent and identically distributed random variables, covering both discrete and continuous distributions as well as exponential and non-exponential families. In particular, a closed form expression of the MLE is not required. We also use a perturbation method to treat cases where the MLE has positive probability of being on the boundary of the parameter space.

preprint2016arXiv

Estimating the number of communities in a network

Community detection, the division of a network into dense subnetworks with only sparse connections between them, has been a topic of vigorous study in recent years. However, while there exist a range of powerful and flexible methods for dividing a network into a specified number of communities, it is an open question how to determine exactly how many communities one should use. Here we describe a mathematically principled approach for finding the number of communities in a network using a maximum-likelihood method. We demonstrate the approach on a range of real-world examples with known community structure, finding that it is able to determine the number of communities correctly in every case.

preprint2016arXiv

Stein's method for comparison of univariate distributions

We propose a new general version of Stein's method for univariate distributions. In particular we propose a canonical definition of the Stein operator of a probability distribution {which is based on a linear difference or differential-type operator}. The resulting Stein identity highlights the unifying theme behind the literature on Stein's method (both for continuous and discrete distributions). Viewing the Stein operator as an operator acting on pairs of functions, we provide an extensive toolkit for distributional comparisons. Several abstract approximation theorems are provided. Our approach is illustrated for comparison of several pairs of distributions : normal vs normal, sums of independent Rademacher vs normal, normal vs Student, and maximum of random variables vs exponential, Frechet and Gumbel.

preprint2015arXiv

Conditional risk measures in a bipartite market structure

In this paper we study the effect of network structure between agents and objects on measures for systemic risk. We model the influence of sharing large exogeneous losses to the financial or (re)insuance market by a bipartite graph. Using Pareto-tailed losses and multivariate regular variation we obtain asymptotic results for systemic conditional risk measures based on the Value-at-Risk and the Conditional Tail Expectation. These results allow us to assess the influence of an individual institution on the systemic or market risk and vice versa through a collection of conditional systemic risk measures. For large markets Poisson approximations of the relevant constants are provided in the example of an insurance market. The example of an underlying homogeneous random graph is analysed in detail, and the results are illustrated through simulations.

preprint2015arXiv

Distances between nested densities and a measure of the impact of the prior in Bayesian statistics

In this paper we propose tight upper and lower bounds for the Wasserstein distance between any two {univariate continuous distributions} with probability densities $p_1$ and $p_2$ having nested supports. These explicit bounds are expressed in terms of the derivative of the likelihood ratio $p_1/p_2$ as well as the Stein kernel $τ_1$ of $p_1$. The method of proof relies on a new variant of Stein's method which manipulates Stein operators. We give several applications of these bounds. Our main application is in Bayesian statistics : we derive explicit data-driven bounds on the Wasserstein distance between the posterior distribution based on a given prior and the no-prior posterior based uniquely on the sampling distribution. This is the first finite sample result confirming the well-known fact that with well-identified parameters and large sample sizes, reasonable choices of prior distributions will have only minor effects on posterior inferences if the data are benign.

preprint2015arXiv

Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics

Next Generation Sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential. A plausible model for this underlying distribution of word counts is given through modelling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution, using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate them using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results, and that the clustering results that use a MC of the estimated order give a plausible clustering of the species.

preprint2015arXiv

Risk in a large claims insurance market with bipartite graph structure

We model the influence of sharing large exogeneous losses to the reinsurance market by a bipartite graph. Using Pareto-tailed claims and multivariate regular variation we obtain asymptotic results for the Value-at-Risk and the Conditional Tail Expectation. We show that the dependence on the network structure plays a fundamental role in their asymptotic behaviour. As is well-known in a non-network setting, if the Pareto exponent is larger than 1, then for the individual agent (reinsurance company) diversification is beneficial, whereas when it is less than 1, concentration on a few objects is the better strategy. An additional aspect of this paper is the amount of uninsured losses which have to be convered by society. In the situation of networks of agents, in our setting diversification is never detrimental concerning the amount of uninsured losses. If the Pareto-tailed claims have finite mean, diversification turns out to be never detrimental, both for society and for individual agents. In contrast, if the Pareto-tailed claims have infinite mean, a conflicting situation may arise between the incentives of individual agents and the interest of some regulator to keep risk for society small. We explain the influence of the network structure on diversification effects in different network scenarios.

preprint2014arXiv

A two-component copula with links to insurance

This paper presents a new copula to model dependencies between insurance entities, by considering how insurance entities are affected by both macro and micro factors. The model used to build the copula assumes that the insurance losses of two companies or lines of business are related through a random common loss factor which is then multiplied by an individual random company factor to get the total loss amounts. The new two-component copula is not Archimedean and it extends the toolkit of copulas for the insurance industry.

preprint2013arXiv

Approximating the epidemic curve

Many models of epidemic spread have a common qualitative structure. The numbers of infected individuals during the initial stages of an epidemic can be well approximated by a branching process, after which the proportion of individuals that are susceptible follows a more or less deterministic course. In this paper, we show that both of these features are consequences of assuming a locally branching structure in the models, and that the deterministic course can itself be determined from the distribution of the limiting random variable associated with the backward, susceptibility branching process. Examples considered include a stochastic version of the Kermack & McKendrick model, the Reed-Frost model, and the Volz configuration model.

preprint2013arXiv

Stein's method for the Beta distribution and the Pólya-Eggenberger Urn

Using a characterizing equation for the Beta distribution, Stein's method is applied to obtain bounds of the optimal order for the Wasserstein distance between the distribution of the scaled number of white balls drawn from a Pólya-Eggenberger urn and its limiting Beta distribution. The bound is computed by making a direct comparison between characterizing operators of the target and the Beta distribution, the former derived by extending Stein's density approach to discrete distributions. In addition, refinements are given to Döbler's result [12] for the Arcsine approximation for the fraction of time a simple random walk of even length spends positive, and so also to the distributions of its last return time to zero and its first visit to its terminal point, by supplying explicit constants to the present Wasserstein bound and also demonstrating that its rate is of the optimal order.

preprint2010arXiv

Invariance principles for homogeneous sums: Universality of Gaussian Wiener chaos

We compute explicit bounds in the normal and chi-square approximations of multilinear homogenous sums (of arbitrary order) of general centered independent random variables with unit variance. In particular, we show that chaotic random variables enjoy the following form of universality: (a) the normal and chi-square approximations of any homogenous sum can be completely characterized and assessed by first switching to its Wiener chaos counterpart, and (b) the simple upper bounds and convergence criteria available on the Wiener chaos extend almost verbatim to the class of homogeneous sums.

preprint2010arXiv

Second order Poincaré inequalities and CLTs on Wiener space

We prove infinite-dimensional second order Poincaré inequalities on Wiener space, thus closing a circle of ideas linking limit theorems for functionals of Gaussian fields, Stein's method and Malliavin calculus. We provide two applications: (i) to a new "second order" characterization of CLTs on a fixed Wiener chaos, and (ii) to linear functionals of Gaussian-subordinated fields.

preprint2009arXiv

Multivariate normal approximation with Stein's method of exchangeable pairs under a general linearity condition

In this paper we establish a multivariate exchangeable pairs approach within the framework of Stein's method to assess distributional distances to potentially singular multivariate normal distributions. By extending the statistics into a higher-dimensional space, we also propose an embedding method which allows for a normal approximation even when the corresponding statistics of interest do not lend themselves easily to Stein's exchangeable pairs approach. To illustrate the method, we provide the examples of runs on the line as well as double-indexed permutation statistics.

Gesine Reinert

What is connected

Connect this record

See the researcher in context

Building this map preview

24 published item(s)

A Kernelised Stein Statistic for Assessing Implicit Generative Models

Bounds for the chi-square approximation of Friedman's statistic by Stein's method

GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks

Lead-lag detection and network clustering for multivariate time series with an application to the US equity market

Multivariate Central Limit Theorems for Random Clique Complexes

Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics

Relaxing the Gaussian assumption in Shrinkage and SURE in high dimension

SSSNET: Semi-Supervised Signed Network Clustering

Stein's Method Meets Computational Statistics: A Review of Some Recent Developments

A Stein Goodness of fit Test for Exponential Random Graph Models

Ruin probabilities for risk processes in a bipartite network

Bounds for the normal approximation of the maximum likelihood estimator

Estimating the number of communities in a network

Stein's method for comparison of univariate distributions

Conditional risk measures in a bipartite market structure

Distances between nested densities and a measure of the impact of the prior in Bayesian statistics

Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics

Risk in a large claims insurance market with bipartite graph structure

A two-component copula with links to insurance

Approximating the epidemic curve

Stein's method for the Beta distribution and the Pólya-Eggenberger Urn

Invariance principles for homogeneous sums: Universality of Gaussian Wiener chaos

Second order Poincaré inequalities and CLTs on Wiener space

Multivariate normal approximation with Stein's method of exchangeable pairs under a general linearity condition