Source author record

Vittorio Erba

Vittorio Erba appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cond-mat.dis-nn cond-mat.stat-mech Machine Learning

Catalog footprint

What is connected

3works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Intrinsic dimension estimation for locally undersampled data

High-dimensional data are ubiquitous in contemporary science and finding methods to compress them is one of the primary goals of machine learning. Given a dataset lying in a high-dimensional space (in principle hundreds to several thousands of dimensions), it is often useful to project it onto a lower-dimensional manifold, without loss of information. Identifying the minimal dimension of such manifold is a challenging problem known in the literature as intrinsic dimension estimation (IDE). Traditionally, most IDE algorithms are either based on multiscale principal component analysis (PCA) or on the notion of correlation dimension (and more in general on k-nearest-neighbors distances). These methods are affected, in different ways, by a severe curse of dimensionality. In particular, none of the existing algorithms can provide accurate ID estimates in the extreme locally undersampled regime, i.e. in the limit where the number of samples in any local patch of the manifold is less than (or of the same order of) the ID of the dataset. Here we introduce a new ID estimator that leverages on simple properties of the tangent space of a manifold to overcome these shortcomings. The method is based on the full correlation integral, going beyond the limit of small radius used for the estimation of the correlation dimension. Our estimator alleviates the extreme undersampling problem, intractable with other methods. Based on this insight, we explore a multiscale generalization of the algorithm. We show that it is capable of (i) identifying multiple dimensionalities in a dataset, and (ii) providing accurate estimates of the ID of extremely curved manifolds. In particular, we test the method on manifolds generated from global transformations of high-contrast images, relevant for invariant object recognition and considered a challenge for state-of-the-art ID estimators.

preprint2020arXiv

Random geometric graphs in high dimension

Many machine learning algorithms used for dimensional reduction and manifold learning leverage on the computation of the nearest neighbours to each point of a dataset to perform their tasks. These proximity relations define a so-called geometric graph, where two nodes are linked if they are sufficiently close to each other. Random geometric graphs, where the positions of nodes are randomly generated in a subset of $\mathbb{R}^{d}$, offer a null model to study typical properties of datasets and of machine learning algorithms. Up to now, most of the literature focused on the characterization of low-dimensional random geometric graphs whereas typical datasets of interest in machine learning live in high-dimensional spaces ($d \gg 10^{2}$). In this work, we consider the infinite dimensions limit of hard and soft random geometric graphs and we show how to compute the average number of subgraphs of given finite size $k$, e.g. the average number of $k$-cliques. This analysis highlights that local observables display different behaviors depending on the chosen ensemble: soft random geometric graphs with continuous activation functions converge to the naive infinite dimensional limit provided by Erdös-Rényi graphs, whereas hard random geometric graphs can show systematic deviations from it. We present numerical evidence that our analytical insights, exact in infinite dimensions, provide a good approximation also for dimension $d\gtrsim10$.

preprint2020arXiv

The Dyck bound in the concave 1-dimensional random assignment model

We consider models of assignment for random $N$ blue points and $N$ red points on an interval of length $2N$, in which the cost for connecting a blue point in $x$ to a red point in $y$ is the concave function $|x-y|^p$, for $0<p<1$. Contrarily to the convex case $p>1$, where the optimal matching is trivially determined, here the optimization is non-trivial. The purpose of this paper is to introduce a special configuration, that we call the \emph{Dyck matching}, and to study its statistical properties. We compute exactly the average cost, in the asymptotic limit of large $N$, together with the first subleading correction. The scaling is remarkable: it is of order $N$ for $p<\frac{1}{2}$, order $N \ln N$ for $p=\frac{1}{2}$, and $N^{\frac{1}{2}+p}$ for $p>\frac{1}{2}$, and it is universal for a wide class of models. We conjecture that the average cost of the Dyck matching has the same scaling in $N$ as the cost of the optimal matching, and we produce numerical data in support of this conjecture. We hope to produce a proof of this claim in future work.

Vittorio Erba

What is connected

Connect this record

See the researcher in context

Building this map preview

3 published item(s)

Intrinsic dimension estimation for locally undersampled data

Random geometric graphs in high dimension

The Dyck bound in the concave 1-dimensional random assignment model