Source author record

Frank Nielsen

Frank Nielsen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Information Theory math.IT Computer Vision Computational Geometry q-fin.ST Computational Engineering, Finance, and Science Human-Computer Interaction math.ST Methodology Statistics Theory Computational Complexity Cryptography and Security Graphics Information Retrieval

Catalog footprint

What is connected

43works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A note on Onicescu's informational energy and correlation coefficient in exponential families

The informational energy of Onicescu is a positive quantity that measures the amount of uncertainty of a random variable. But contrary to Shannon's entropy, the informational energy increases when randomness decreases. We report closed-form formula for Onicescu's informational energy and its associated correlation coefficient when the probability distributions belong to an exponential family. We show how to instantiate the generic formula for several common exponential families.

preprint2022arXiv

A note on some information-theoretic divergences between Zeta distributions

We consider the zeta distributions which are discrete power law distributions that can be interpreted as the counterparts of the continuous Pareto distributions with unit scale. The family of zeta distributions forms a discrete exponential family with normalizing constants expressed using the Riemann zeta function. We report several information-theoretic measures between zeta distributions and study their underlying information geometry.

preprint2022arXiv

On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models

A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these models are said to be singular. Singular models reveal several important and well-studied problems in machine learning like the decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we propose a relative reparameterization technique of the parameter space, which yields a general method for extracting regular submodels from singular models. Our method enforces model identifiability during training and we study the learning dynamics for gradient descent and expectation maximization for Gaussian Mixture Models (GMMs) under relative parameterization, showing faster experimental convergence and a improved manifold shape of the dynamics around the singularity. Extending the analysis beyond GMMs, we furthermore analyze the Fisher information matrix under relative reparameterization and its influence on the generalization error, and show how the method can be applied to more complex models like deep neural networks.

preprint2022arXiv

The analytic dually flat space of the mixture family of two prescribed distinct Cauchy distributions

A smooth and strictly convex function on an open convex domain induces both (1) a Hessian manifold with respect to the standard flat Euclidean connection, and (2) a dually flat space of information geometry. We first review these constructions and illustrate how to instantiate them for (a) full regular exponential families from their partition functions, (b) regular homogeneous cones from their characteristic functions, and (c) mixture families from their Shannon negentropy functions. Although these structures can be explicitly built for many common examples of the first two classes, the differential entropy of a continuous statistical mixture with distinct prescribed density components sharing the same support is hitherto not known in closed form, hence forcing implementations of mixture family manifolds in practice using Monte Carlo sampling. In this work, we report a notable exception: The family of mixtures defined as the convex combination of two prescribed and distinct Cauchy distributions. As a byproduct, we report closed-form formula for the Jensen-Shannon divergence between two mixtures of two prescribed Cauchy components.

preprint2022arXiv

Tractable structured natural gradient descent using local parameterizations

Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods.

preprint2021arXiv

Likelihood Ratio Exponential Families

The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints, while the geometric mixture path is common in MCMC methods such as annealed importance sampling. Linking these two ideas, recent work has interpreted the geometric mixture path as an exponential family of distributions to analyze the thermodynamic variational objective (TVO). We extend these likelihood ratio exponential families to include solutions to rate-distortion (RD) optimization, the information bottleneck (IB) method, and recent rate-distortion-classification approaches which combine RD and IB. This provides a common mathematical framework for understanding these methods via the conjugate duality of exponential families and hypothesis testing. Further, we collect existing results to provide a variational representation of intermediate RD or TVO distributions as a minimizing an expectation of KL divergences. This solution also corresponds to a size-power tradeoff using the likelihood ratio test and the Neyman Pearson lemma. In thermodynamic integration bounds such as the TVO, we identify the intermediate distribution whose expected sufficient statistics match the log partition function.

preprint2021arXiv

On information projections between multivariate elliptical and location-scale families

We study information projections with respect to statistical $f$-divergences between any two location-scale families. We consider a multivariate generalization of the location-scale families which includes the elliptical and the spherical subfamilies. By using the action of the multivariate location-scale group, we show how to reduce the calculation of $f$-divergences between any two location-scale densities to canonical settings involving standard densities, and derive thereof fast Monte Carlo estimators of $f$-divergences with good properties. Finally, we prove that the minimum $f$-divergence between a prescribed density of a location-scale family and another location-scale family is independent of the prescribed location-scale parameter. We interpret geometrically this property.

preprint2021arXiv

On the Kullback-Leibler divergence between discrete normal distributions

Discrete normal distributions are defined as the distributions with prescribed means and covariance matrices which maximize entropy on the integer lattice support. The set of discrete normal distributions form an exponential family with cumulant function related to the Riemann theta function. In this paper, we present several formula for common statistical divergences between discrete normal distributions including the Kullback-Leibler divergence. In particular, we describe an efficient approximation technique for calculating the Kullback-Leibler divergence between discrete normal distributions via the Rényi $α$-divergences or the projective $γ$-divergences.

preprint2021arXiv

On the Kullback-Leibler divergence between location-scale densities

We show that the $f$-divergence between any two densities of potentially different location-scale families can be reduced to the calculation of the $f$-divergence between one standard density with another location-scale density. It follows that the $f$-divergence between two scale densities depends only on the scale ratio. We then report conditions on the standard distribution to get symmetric $f$-divergences: First, we prove that all $f$-divergences between densities of a location family are symmetric whenever the standard density is even, and second, we illustrate a generic symmetric property with the calculation of the Kullback-Leibler divergence between scale Cauchy distributions. Finally, we show that the minimum $f$-divergence of any query density of a location-scale family to another location-scale family is independent of the query location-scale parameters.

preprint2020arXiv

Cumulant-free closed-form formulas for some common (dis)similarities between densities of an exponential family

It is well-known that the Bhattacharyya, Hellinger, Kullback-Leibler, $α$-divergences, and Jeffreys' divergences between densities belonging to a same exponential family have generic closed-form formulas relying on the strictly convex and real-analytic cumulant function characterizing the exponential family. In this work, we report (dis)similarity formulas which bypass the explicit use of the cumulant function and highlight the role of quasi-arithmetic means and their multivariate mean operator extensions. In practice, these cumulant-free formulas are handy when implementing these (dis)similarities using legacy Application Programming Interfaces (APIs) since our method requires only to partially factorize the densities canonically of the considered exponential family.

preprint2020arXiv

k-medoids and p-median clustering are solvable in polynomial time for a 2d Pareto front

This paper examines a common extension of k-medoids and k-median clustering in the case of a two-dimensional Pareto front, as generated by bi-objective optimization approaches. A characterization of optimal clusters is provided, which allows to solve the optimization problems to optimality in polynomial time using a common dynamic programming algorithm. More precisely, having $N$ points to cluster in $K$ subsets, the complexity of the algorithm is proven in $O(N^3)$ time and $O(K.N)$ memory space when $K\geqslant 3$, cases $K=2$ having a time complexity in $O(N^2)$. Furthermore, speeding-up the dynamic programming algorithm is possible avoiding useless computations, for a practical speed-up without improving the complexity. Parallelization issues are also discussed, to speed-up the algorithm in practice.

preprint2020arXiv

On Voronoi diagrams and dual Delaunay complexes on the information-geometric Cauchy manifolds

We study the Voronoi diagrams of a finite set of Cauchy distributions and their dual complexes from the viewpoint of information geometry by considering the Fisher-Rao distance, the Kullback-Leibler divergence, the chi square divergence, and a flat divergence derived from Tsallis' quadratic entropy related to the conformal flattening of the Fisher-Rao curved geometry. We prove that the Voronoi diagrams of the Fisher-Rao distance, the chi square divergence, and the Kullback-Leibler divergences all coincide with a hyperbolic Voronoi diagram on the corresponding Cauchy location-scale parameters, and that the dual Cauchy hyperbolic Delaunay complexes are Fisher orthogonal to the Cauchy hyperbolic Voronoi diagrams. The dual Voronoi diagrams with respect to the dual forward/reverse flat divergences amount to dual Bregman Voronoi diagrams, and their dual complexes are regular triangulations. The primal Bregman-Tsallis Voronoi diagram corresponds to the hyperbolic Voronoi diagram and the dual Bregman-Tsallis Voronoi diagram coincides with the ordinary Euclidean Voronoi diagram. Besides, we prove that the square root of the Kullback-Leibler divergence between Cauchy distributions yields a metric distance which is Hilbertian for the Cauchy scale families.

preprint2020arXiv

Schoenberg-Rao distances: Entropy-based and geometry-aware statistical Hilbert distances

Distances between probability distributions that take into account the geometry of their sample space,like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports. In this paper, we study a class of statistical Hilbert distances that we term the Schoenberg-Rao distances, a generalization of the MMD that allows one to consider a broader class of kernels, namely the conditionally negative semi-definite kernels. In particular, we introduce a principled way to construct such kernels and derive novel closed-form distances between mixtures of Gaussian distributions. These distances, derived from the concave Rao's quadratic entropy, enjoy nice theoretical properties and possess interpretable hyperparameters which can be tuned for specific applications. Our method constitutes a practical alternative to Wasserstein distances and we illustrate its efficiency on a broad range of machine learning tasks such as density estimation, generative modeling and mixture simplification.

preprint2016arXiv

A series of maximum entropy upper bounds of the differential entropy

We present a series of closed-form maximum entropy upper bounds for the differential entropy of a continuous univariate random variable and study the properties of that series. We then show how to use those generic bounds for upper bounding the differential entropy of Gaussian mixture models. This requires to calculate the raw moments and raw absolute moments of Gaussian mixtures in closed-form that may also be handy in statistical machine learning and information theory. We report on our experiments and discuss on the tightness of those bounds.

preprint2016arXiv

Clustering Financial Time Series: How Long is Enough?

Researchers have used from 30 days to several years of daily returns as source data for clustering financial time series based on their correlations. This paper sets up a statistical framework to study the validity of such practices. We first show that clustering correlated random variables from their observed values is statistically consistent. Then, we also give a first empirical answer to the much debated question: How long should the time series be? If too short, the clusters found can be spurious; if too long, dynamics can be smoothed out.

preprint2016arXiv

Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset. The methodology leverages copulas for encoding dependence between two variables, state-of-the-art optimal transport for providing a relevant geometry to the copulas, and clustering for summarizing the main dependence patterns found between the variables. Some of the clusters centers can be used to parameterize a novel dependence coefficient which can target or forget specific dependence patterns. Finally, we illustrate and benchmark the methodology on several datasets. Code and numerical experiments are available online for reproducible research.

preprint2016arXiv

Fast $(1+ε)$-approximation of the Löwner extremal matrices of high-dimensional symmetric matrices

Matrix data sets are common nowadays like in biomedical imaging where the Diffusion Tensor Magnetic Resonance Imaging (DT-MRI) modality produces data sets of 3D symmetric positive definite matrices anchored at voxel positions capturing the anisotropic diffusion properties of water molecules in biological tissues. The space of symmetric matrices can be partially ordered using the Löwner ordering, and computing extremal matrices dominating a given set of matrices is a basic primitive used in matrix-valued signal processing. In this letter, we design a fast and easy-to-implement iterative algorithm to approximate arbitrarily finely these extremal matrices. Finally, we discuss on extensions to matrix clustering.

preprint2016arXiv

HCMapper: An interactive visualization tool to compare partition-based flat clustering extracted from pairs of dendrograms

We describe a new visualization tool, dubbed HCMapper, that visually helps to compare a pair of dendrograms computed on the same dataset by displaying multiscale partition-based layered structures. The dendrograms are obtained by hierarchical clustering techniques whose output reflects some hypothesis on the data and HCMapper is specifically designed to grasp at first glance both whether the two compared hypotheses broadly agree and the data points on which they do not concur. Leveraging juxtaposition and explicit encodings, HCMapper focus on two selected partitions while displaying coarser ones in context areas for understanding multiscale structure and eventually switching the selected partitions. HCMapper utility is shown through the example of testing whether the prices of credit default swap financial time series only undergo correlation. This use case is detailed in the supplementary material as well as experiments with code on toy-datasets for reproducible research. HCMapper is currently released as a visualization tool on the DataGrapple time series and clustering analysis platorm at www.datagrapple.com.

preprint2016arXiv

Image and Information

A well-known old adage says that {\em "A picture is worth a thousand words!"} (attributed to the Chinese philosopher Confucius ca 500 years BC). But more precisely, what do we mean by information in images? And how can it be retrieved effectively by machines? We briefly highlight these puzzling questions in this column. But first of all, let us start by defining more precisely what is meant by an "Image."

preprint2016arXiv

k-variates++: more pluses in the k-means++

k-means++ seeding has become a de facto standard for hard clustering algorithms. In this paper, our first contribution is a two-way generalisation of this seeding, k-variates++, that includes the sampling of general densities rather than just a discrete set of Dirac densities anchored at the point locations, and a generalisation of the well known Arthur-Vassilvitskii (AV) approximation guarantee, in the form of a bias+variance approximation bound of the global optimum. This approximation exhibits a reduced dependency on the "noise" component with respect to the optimal potential --- actually approaching the statistical lower bound. We show that k-variates++ reduces to efficient (biased seeding) clustering algorithms tailored to specific frameworks; these include distributed, streaming and on-line clustering, with direct approximation results for these algorithms. Finally, we present a novel application of k-variates++ to differential privacy. For either the specific frameworks considered here, or for the differential privacy setting, there is little to no prior results on the direct application of k-means++ and its approximation bounds --- state of the art contenders appear to be significantly more complex and / or display less favorable (approximation) properties. We stress that our algorithms can still be run in cases where there is \textit{no} closed form solution for the population minimizer. We demonstrate the applicability of our analysis via experimental evaluation on several domains and settings, displaying competitive performances vs state of the art.

preprint2016arXiv

Large Margin Nearest Neighbor Classification using Curved Mahalanobis Distances

We consider the supervised classification problem of machine learning in Cayley-Klein projective geometries: We show how to learn a curved Mahalanobis metric distance corresponding to either the hyperbolic geometry or the elliptic geometry using the Large Margin Nearest Neighbor (LMNN) framework. We report on our experimental results, and further consider the case of learning a mixed curved Mahalanobis distance. Besides, we show that the Cayley-Klein Voronoi diagrams are affine, and can be built from an equivalent (clipped) power diagrams, and that Cayley-Klein balls have Mahalanobis shapes with displaced centers.

preprint2016arXiv

Loss factorization, weakly supervised learning and label noise robustness

We prove that the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the loss. This holds true even for non-smooth, non-convex losses and in any RKHS. The first term is a (kernel) mean operator --the focal quantity of this work-- which we characterize as the sufficient statistic for the labels. The result tightens known generalization bounds and sheds new light on their interpretation. Factorization has a direct application on weakly supervised learning. In particular, we demonstrate that algorithms like SGD and proximal methods can be adapted with minimal effort to handle weak supervision, once the mean operator has been estimated. We apply this idea to learning with asymmetric noisy labels, connecting and extending prior work. Furthermore, we show that most losses enjoy a data-dependent (by the mean operator) form of noise robustness, in contrast with known negative results.

preprint2016arXiv

On clustering financial time series: a need for distances between dependent random variables

The following working document summarizes our work on the clustering of financial time series. It was written for a workshop on information geometry and its application for image and signal processing. This workshop brought several experts in pure and applied mathematics together with applied researchers from medical imaging, radar signal processing and finance. The authors belong to the latter group. This document was written as a long introduction to further development of geometric tools in financial applications such as risk or portfolio analysis. Indeed, risk and portfolio analysis essentially rely on covariance matrices. Besides that the Gaussian assumption is known to be inaccurate, covariance matrices are difficult to estimate from empirical data. To filter noise from the empirical estimate, Mantegna proposed using hierarchical clustering. In this work, we first show that this procedure is statistically consistent. Then, we propose to use clustering with a much broader application than the filtering of empirical covariance matrices from the estimate correlation coefficients. To be able to do that, we need to obtain distances between the financial time series that incorporate all the available information in these cross-dependent random processes.

preprint2016arXiv

Optimal Copula Transport for Clustering Multivariate Time Series

This paper presents a new methodology for clustering multivariate time series leveraging optimal transport between copulas. Copulas are used to encode both (i) intra-dependence of a multivariate time series, and (ii) inter-dependence between two time series. Then, optimal copula transport allows us to define two distances between multivariate time series: (i) one for measuring intra-dependence dissimilarity, (ii) another one for measuring inter-dependence dissimilarity based on a new multivariate dependence coefficient which is robust to noise, deterministic, and which can target specified dependencies.

preprint2016arXiv

Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

We present a methodology for clustering N objects which are described by multivariate time series, i.e. several sequences of real-valued random variables. This clustering methodology leverages copulas which are distributions encoding the dependence structure between several random variables. To take fully into account the dependence information while clustering, we need a distance between copulas. In this work, we compare renowned distances between distributions: the Fisher-Rao geodesic distance, related divergences and optimal transport, and discuss their advantages and disadvantages. Applications of such methodology can be found in the clustering of financial assets. A tutorial, experiments and implementation for reproducible research can be found at www.datagrapple.com/Tech.

preprint2016arXiv

Relative Natural Gradient for Learning Large Complex Models

Fisher information and natural gradient provided deep insights and powerful tools to artificial neural networks. However related analysis becomes more and more difficult as the learner's structure turns large and complex. This paper makes a preliminary step towards a new direction. We extract a local component of a large neuron system, and defines its relative Fisher information metric that describes accurately this small component, and is invariant to the other parts of the system. This concept is important because the geometry structure is much simplified and it can be easily applied to guide the learning of neural networks. We provide an analysis on a list of commonly used components, and demonstrate how to use this concept to further improve optimization.

preprint2016arXiv

Tsallis Regularized Optimal Transport and Ecological Inference

Optimal transport is a powerful framework for computing distances between probability distributions. We unify the two main approaches to optimal transport, namely Monge-Kantorovitch and Sinkhorn-Cuturi, into what we define as Tsallis regularized optimal transport (\trot). \trot~interpolates a rich family of distortions from Wasserstein to Kullback-Leibler, encompassing as well Pearson, Neyman and Hellinger divergences, to name a few. We show that metric properties known for Sinkhorn-Cuturi generalize to \trot, and provide efficient algorithms for finding the optimal transportation plan with formal convergence proofs. We also present the first application of optimal transport to the problem of ecological inference, that is, the reconstruction of joint distributions from their marginals, a problem of large interest in the social sciences. \trot~provides a convenient framework for ecological inference by allowing to compute the joint distribution --- that is, the optimal transportation plan itself --- when side information is available, which is \textit{e.g.} typically what census represents in political science. Experiments on data from the 2012 US presidential elections display the potential of \trot~in delivering a faithful reconstruction of the joint distribution of ethnic groups and voter preferences.

preprint2015arXiv

A proposal of a methodological framework with experimental guidelines to investigate clustering stability on financial time series

We present in this paper an empirical framework motivated by the practitioner point of view on stability. The goal is to both assess clustering validity and yield market insights by providing through the data perturbations we propose a multi-view of the assets' clustering behaviour. The perturbation framework is illustrated on an extensive credit default swap time series database available online at www.datagrapple.com.

preprint2015arXiv

Comment partitionner automatiquement des marches aléatoires ? Avec application à la finance quantitative

We present in this paper a novel non-parametric approach useful for clustering Markov processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any. An associated metric is defined where the balance between random variables dependency and distribution information is controlled by a single parameter. This mixing parameter can be learned or played with by a practitioner, such use is illustrated on the case of clustering financial time series. Experiments, implementation and results obtained on public financial time series are online on a web portal \url{http://www.datagrapple.com}.

preprint2015arXiv

On conformal divergences and their population minimizers

Total Bregman divergences are a recent tweak of ordinary Bregman divergences originally motivated by applications that required invariance by rotations. They have displayed superior results compared to ordinary Bregman divergences on several clustering, computer vision, medical imaging and machine learning tasks. These preliminary results raise two important problems : First, report a complete characterization of the left and right population minimizers for this class of total Bregman divergences. Second, characterize a principled superset of total and ordinary Bregman divergences with good clustering properties, from which one could tailor the choice of a divergence to a particular application. In this paper, we provide and study one such superset with interesting geometric features, that we call conformal divergences, and focus on their left and right population minimizers. Our results are obtained in a recently coined $(u, v)$-geometric structure that is a generalization of the dually flat affine connections in information geometry. We characterize both analytically and geometrically the population minimizers. We prove that conformal divergences (resp. total Bregman divergences) are essentially exhaustive for their left (resp. right) population minimizers. We further report new results and extend previous results on the robustness to outliers of the left and right population minimizers, and discuss the role of the $(u, v)$-geometric structure in clustering. Additional results are also given.

preprint2014arXiv

Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means

Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when $k$ or $d$ increases, or when performing several restarts. First, we show that those special events are a blessing because they allow to partially re-seed some cluster centers while further minimizing the $k$-means objective function. Second, we describe a novel heuristic, merge-and-split $k$-means, that consists in merging two clusters and splitting this merged cluster again with two new centers provided it improves the $k$-means objective. This novel heuristic can improve Hartigan's $k$-means when it has converged to a local minimum. We show empirically that this merge-and-split $k$-means improves over the Hartigan's heuristic which is the {\em de facto} method of choice. Finally, we propose the $(k,l)$-means objective that generalizes the $k$-means objective by associating the data points to their $l$ closest cluster centers, and show how to either directly convert or iteratively relax the $(k,l)$-means into a $k$-means in order to reach better local minima.

preprint2014arXiv

Further results on the hyperbolic Voronoi diagrams

In Euclidean geometry, it is well-known that the $k$-order Voronoi diagram in $\mathbb{R}^d$ can be computed from the vertical projection of the $k$-level of an arrangement of hyperplanes tangent to a convex potential function in $\mathbb{R}^{d+1}$: the paraboloid. Similarly, we report for the Klein ball model of hyperbolic geometry such a {\em concave} potential function: the northern hemisphere. Furthermore, we also show how to build the hyperbolic $k$-order diagrams as equivalent clipped power diagrams in $\mathbb{R}^d$. We investigate the hyperbolic Voronoi diagram in the hyperboloid model and show how it reduces to a Klein-type model using central projections.

preprint2014arXiv

On the symmetrical Kullback-Leibler Jeffreys centroids

Due to the success of the bag-of-word modeling paradigm, clustering histograms has become an important ingredient of modern information processing. Clustering histograms can be performed using the celebrated $k$-means centroid-based algorithm. From the viewpoint of applications, it is usually required to deal with symmetric distances. In this letter, we consider the Jeffreys divergence that symmetrizes the Kullback-Leibler divergence, and investigate the computation of Jeffreys centroids. We first prove that the Jeffreys centroid can be expressed analytically using the Lambert $W$ function for positive histograms. We then show how to obtain a fast guaranteed approximation when dealing with frequency histograms. Finally, we conclude with some remarks on the $k$-means histogram clustering.

preprint2014arXiv

Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

We present a generic dynamic programming method to compute the optimal clustering of $n$ scalar elements into $k$ pairwise disjoint intervals. This case includes 1D Euclidean $k$-means, $k$-medoids, $k$-medians, $k$-centers, etc. We extend the method to incorporate cluster size constraints and show how to choose the appropriate $k$ by model selection. Finally, we illustrate and refine the method on two case studies: Bregman clustering and statistical mixture learning maximizing the complete likelihood.

preprint2013arXiv

Logging safely in public spaces using color PINs

Nowadays, we are increasingly logging on many different Internet sites to access private data like emails or photos remotely stored in the clouds. This makes us all the more concerned with digital identity theft and passwords being stolen either by key loggers or shoulder-surfing attacks. Quite surprisingly, the current bottleneck of computer security when logging for authentication is the User Interface (UI): How can we enter safely secret passwords when concealed spy cameras or key loggers may be recording the login session? Logging safely requires to design a secure Human Computer Interface (HCI) robust to those attacks. We describe a novel method and system based on entering secret ID passwords by means of associative secret UI passwords that provides zero-knowledge to observers. We demonstrate the principles using a color Personal Identification Numbers (PINs) login system and describes its various extensions.

preprint2013arXiv

On the Chi square and higher-order Chi distances for approximating f-divergences

We report closed-form formula for calculating the Chi square and higher-order Chi distances between statistical distributions belonging to the same exponential family with affine natural space, and instantiate those formula for the Poisson and isotropic Gaussian families. We then describe an analytic formula for the $f$-divergences based on Taylor expansions and relying on an extended class of Chi-type distances.

preprint2012arXiv

$k$-MLE: A fast algorithm for learning statistical mixture models

We describe $k$-MLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectation-maximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering $k$-MLE algorithm iteratively assigns data to the most likely weighted component and update the component models using Maximum Likelihood Estimators (MLEs). Using the duality between exponential families and Bregman divergences, we prove that the local convergence of the complete likelihood of $k$-MLE follows directly from the convergence of a dual additively weighted Bregman hard clustering. The inner loop of $k$-MLE can be implemented using any $k$-means heuristic like the celebrated Lloyd's batched or Hartigan's greedy swap updates. We then show how to update the mixture weights by minimizing a cross-entropy criterion that implies to update weights by taking the relative proportion of cluster points, and reiterate the mixture parameter update and mixture weight update processes until convergence. Hard EM is interpreted as a special case of $k$-MLE when both the component update and the weight update are performed successively in the inner loop. To initialize $k$-MLE, we propose $k$-MLE++, a careful initialization of $k$-MLE guaranteeing probabilistically a global bound on the best possible complete likelihood.

preprint2012arXiv

The Burbea-Rao and Bhattacharyya centroids

We study the centroid with respect to the class of information-theoretic Burbea-Rao divergences that generalize the celebrated Jensen-Shannon divergence by measuring the non-negative Jensen difference induced by a strictly convex and differentiable function. Although those Burbea-Rao divergences are symmetric by construction, they are not metric since they fail to satisfy the triangle inequality. We first explain how a particular symmetrization of Bregman divergences called Jensen-Bregman distances yields exactly those Burbea-Rao divergences. We then proceed by defining skew Burbea-Rao divergences, and show that skew Burbea-Rao divergences amount in limit cases to compute Bregman divergences. We then prove that Burbea-Rao centroids are unique, and can be arbitrarily finely approximated by a generic iterative concave-convex optimization algorithm with guaranteed convergence property. In the second part of the paper, we consider the Bhattacharyya distance that is commonly used to measure overlapping degree of probability distributions. We show that Bhattacharyya distances on members of the same statistical exponential family amount to calculate a Burbea-Rao divergence in disguise. Thus we get an efficient algorithm for computing the Bhattacharyya centroid of a set of parametric distributions belonging to the same exponential families, improving over former specialized methods found in the literature that were limited to univariate or "diagonal" multivariate Gaussians. To illustrate the performance of our Bhattacharyya/Burbea-Rao centroid algorithm, we present experimental performance results for $k$-means and hierarchical clustering methods of Gaussian mixture models.

preprint2011arXiv

A family of statistical symmetric divergences based on Jensen's inequality

We introduce a novel parametric family of symmetric information-theoretic distances based on Jensen's inequality for a convex functional generator. In particular, this family unifies the celebrated Jeffreys divergence with the Jensen-Shannon divergence when the Shannon entropy generator is chosen. We then design a generic algorithm to compute the unique centroid defined as the minimum average divergence. This yields a smooth family of centroids linking the Jeffreys to the Jensen-Shannon centroid. Finally, we report on our experimental results.

preprint2011arXiv

On Rényi and Tsallis entropies and divergences for exponential families

Many common probability distributions in statistics like the Gaussian, multinomial, Beta or Gamma distributions can be studied under the unified framework of exponential families. In this paper, we prove that both Rényi and Tsallis divergences of distributions belonging to the same exponential family admit a generic closed form expression. Furthermore, we show that Rényi and Tsallis entropies can also be calculated in closed-form for sub-families including the Gaussian or exponential distributions, among others.

preprint2011arXiv

Statistical exponential families: A digest with flash cards

This document describes concisely the ubiquitous class of exponential family distributions met in statistics. The first part recalls definitions and summarizes main properties and duality with Bregman divergences (all proofs are skipped). The second part lists decompositions and related formula of common exponential family distributions. We recall the Fisher-Rao-Riemannian geometries and the dual affine connection information geometries of statistical manifolds. It is intended to maintain and update this document and catalog by adding new distribution items.

preprint2010arXiv

Boosting k-NN for categorization of natural scenes

The k-nearest neighbors (k-NN) classification rule has proven extremely successful in countless many computer vision applications. For example, image categorization often relies on uniform voting among the nearest prototypes in the space of descriptors. In spite of its good properties, the classic k-NN rule suffers from high variance when dealing with sparse prototype datasets in high dimensions. A few techniques have been proposed to improve k-NN classification, which rely on either deforming the nearest neighborhood relationship or modifying the input space. In this paper, we propose a novel boosting algorithm, called UNN (Universal Nearest Neighbors), which induces leveraged k-NN, thus generalizing the classic k-NN rule. We redefine the voting rule as a strong classifier that linearly combines predictions from the k closest prototypes. Weak classifiers are learned by UNN so as to minimize a surrogate risk. A major feature of UNN is the ability to learn which prototypes are the most relevant for a given class, thus allowing one for effective data reduction. Experimental results on the synthetic two-class dataset of Ripley show that such a filtering strategy is able to reject "noisy" prototypes. We carried out image categorization experiments on a database containing eight classes of natural scenes. We show that our method outperforms significantly the classic k-NN classification, while enabling significant reduction of the computational cost by means of data filtering.

preprint2010arXiv

Video Stippling

In this paper, we consider rendering color videos using a non-photo-realistic art form technique commonly called stippling. Stippling is the art of rendering images using point sets, possibly with various attributes like sizes, elementary shapes, and colors. Producing nice stippling is attractive not only for the sake of image depiction but also because it yields a compact vectorial format for storing the semantic information of media. Moreover, stippling is by construction easily tunable to various device resolutions without suffering from bitmap sampling artifacts when resizing. The underlying core technique for stippling images is to compute a centroidal Voronoi tessellation on a well-designed underlying density. This density relates to the image content, and is used to compute a weighted Voronoi diagram. By considering videos as image sequences and initializing properly the stippling of one image by the result of its predecessor, one avoids undesirable point flickering artifacts and can produce stippled videos that nevertheless still exhibit noticeable artifacts. To overcome this, our method improves over the naive scheme by considering dynamic point creation and deletion according to the current scene semantic complexity, and show how to effectively vectorize video while adjusting for both color and contrast characteristics. Furthermore, we explain how to produce high quality stippled ``videos'' (eg., fully dynamic spatio-temporal point sets) for media containing various fading effects, like quick motions of objects or progressive shot changes. We report on practical performances of our implementation, and present several stippled video results rendered on-the-fly using our viewer that allows both spatio-temporal dynamic rescaling (eg., upscale vectorially frame rate).

Frank Nielsen

What is connected

Connect this record

See the researcher in context

Building this map preview

43 published item(s)

A note on Onicescu's informational energy and correlation coefficient in exponential families

A note on some information-theoretic divergences between Zeta distributions

On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models

The analytic dually flat space of the mixture family of two prescribed distinct Cauchy distributions

Tractable structured natural gradient descent using local parameterizations

Likelihood Ratio Exponential Families

On information projections between multivariate elliptical and location-scale families

On the Kullback-Leibler divergence between discrete normal distributions

On the Kullback-Leibler divergence between location-scale densities

Cumulant-free closed-form formulas for some common (dis)similarities between densities of an exponential family

k-medoids and p-median clustering are solvable in polynomial time for a 2d Pareto front

On Voronoi diagrams and dual Delaunay complexes on the information-geometric Cauchy manifolds

Schoenberg-Rao distances: Entropy-based and geometry-aware statistical Hilbert distances

A series of maximum entropy upper bounds of the differential entropy

Clustering Financial Time Series: How Long is Enough?

Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

Fast $(1+ε)$-approximation of the Löwner extremal matrices of high-dimensional symmetric matrices

HCMapper: An interactive visualization tool to compare partition-based flat clustering extracted from pairs of dendrograms

Image and Information

k-variates++: more pluses in the k-means++

Large Margin Nearest Neighbor Classification using Curved Mahalanobis Distances

Loss factorization, weakly supervised learning and label noise robustness

On clustering financial time series: a need for distances between dependent random variables

Optimal Copula Transport for Clustering Multivariate Time Series

Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

Relative Natural Gradient for Learning Large Complex Models

Tsallis Regularized Optimal Transport and Ecological Inference

A proposal of a methodological framework with experimental guidelines to investigate clustering stability on financial time series

Comment partitionner automatiquement des marches aléatoires ? Avec application à la finance quantitative

On conformal divergences and their population minimizers

Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means

Further results on the hyperbolic Voronoi diagrams

On the symmetrical Kullback-Leibler Jeffreys centroids

Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Logging safely in public spaces using color PINs

On the Chi square and higher-order Chi distances for approximating f-divergences

$k$-MLE: A fast algorithm for learning statistical mixture models

The Burbea-Rao and Bhattacharyya centroids

A family of statistical symmetric divergences based on Jensen's inequality

On Rényi and Tsallis entropies and divergences for exponential families

Statistical exponential families: A digest with flash cards

Boosting k-NN for categorization of natural scenes

Video Stippling