Source author record

Alberto Bernacchia

Alberto Bernacchia appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Applications Artificial Intelligence cond-mat.dis-nn math-ph math.MP math.ST Methodology Neurons and Cognition physics.data-an physics.geo-ph Statistics Theory

Catalog footprint

What is connected

8works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

How to distribute data across tasks for meta-learning?

Meta-learning models transfer the knowledge acquired from previous tasks to quickly learn new ones. They are trained on benchmarks with a fixed number of data points per task. This number is usually arbitrary and it is unknown how it affects performance at testing. Since labelling of data is expensive, finding the optimal allocation of labels across training tasks may reduce costs. Given a fixed budget of labels, should we use a small number of highly labelled tasks, or many tasks with few labels each? Should we allocate more labels to some tasks and less to others? We show that: 1) If tasks are homogeneous, there is a uniform optimal allocation, whereby all tasks get the same amount of data; 2) At fixed budget, there is a trade-off between number of tasks and number of data points per task, with a unique solution for the optimum; 3) When trained separately, harder task should get more data, at the cost of a smaller number of tasks; 4) When training on a mixture of easy and hard tasks, more data should be allocated to easy tasks. Interestingly, Neuroscience experiments have shown that human visual skills also transfer better from easy tasks. We prove these results mathematically on mixed linear regression, and we show empirically that the same results hold for few-shot image classification on CIFAR-FS and mini-ImageNet. Our results provide guidance for allocating labels across tasks when collecting data for meta-learning.

preprint2022arXiv

Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning

Kernel-based models such as kernel ridge regression and Gaussian processes are ubiquitous in machine learning applications for regression and optimization. It is well known that a major downside for kernel-based models is the high computational cost; given a dataset of $n$ samples, the cost grows as $\mathcal{O}(n^3)$. Existing sparse approximation methods can yield a significant reduction in the computational cost, effectively reducing the actual cost down to as low as $\mathcal{O}(n)$ in certain cases. Despite this remarkable empirical success, significant gaps remain in the existing results for the analytical bounds on the error due to approximation. In this work, we provide novel confidence intervals for the Nyström method and the sparse variational Gaussian process approximation method, which we establish using novel interpretations of the approximate (surrogate) posterior variance of the models. Our confidence intervals lead to improved performance bounds in both regression and optimization problems.

preprint2021arXiv

Meta-Learning with MAML on Trees

In meta-learning, the knowledge learned from previous tasks is transferred to new ones, but this transfer only works if tasks are related. Sharing information between unrelated tasks might hurt performance, and it is unclear how to transfer knowledge across tasks with a hierarchical structure. Our research extends a model agnostic meta-learning model, MAML, by exploiting hierarchical task relationships. Our algorithm, TreeMAML, adapts the model to each task with a few gradient steps, but the adaptation follows the hierarchical tree structure: in each step, gradients are pooled across tasks clusters, and subsequent steps follow down the tree. We also implement a clustering algorithm that generates the tasks tree without previous knowledge of the task structure, allowing us to make use of implicit relationships between the tasks. We show that the new algorithm, which we term TreeMAML, performs better than MAML when the task structure is hierarchical for synthetic experiments. To study the performance of the method in real-world data, we apply this method to Natural Language Understanding, we use our algorithm to finetune Language Models taking advantage of the language phylogenetic tree. We show that TreeMAML improves the state of the art results for cross-lingual Natural Language Inference. This result is useful, since most languages in the world are under-resourced and the improvement on cross-lingual transfer allows the internationalization of NLP models. This results open the window to use this algorithm in other real-world hierarchical datasets.

preprint2012arXiv

Decorrelation by recurrent inhibition in heterogeneous neural circuits

The activity of neurons is correlated, and this correlation affects how the brain processes information. We study the neural circuit mechanisms of correlations by analyzing a network model characterized by strong and heterogeneous interactions: excitatory input drives the fluctuations of neural activity, which are counterbalanced by inhibitory feedback. In particular, excitatory input tends to correlate neurons, while inhibitory feedback reduces correlations. We demonstrate that heterogeneity of synaptic connections is necessary for this inhibition of correlations. We calculate statistical averages over the disordered synaptic interactions, and we apply our findings to both a simple linear model and to a more realistic spiking network model. We find that correlations at zero time-lag are positive and of magnitude K^{-1/2}, where K is the number of connections to a neuron. Correlations at longer timescales are of smaller magnitude, of order K^{-1}, implying that inhibition of correlations occurs quickly, on a timescale of K^{-1/2}. The small magnitude of correlations agrees qualitatively with physiological measurements in the Cerebral Cortex and Basal Ganglia. The model could be used to study correlations in brain regions dominated by recurrent inhibition, such as the Striatum and Globus Pallidus.

preprint2012arXiv

On the equivalence of Hopfield Networks and Boltzmann Machines

A specific type of neural network, the Restricted Boltzmann Machine (RBM), is implemented for classification and feature detection in machine learning. RBM is characterized by separate layers of visible and hidden units, which are able to learn efficiently a generative model of the observed data. We study a "hybrid" version of RBM's, in which hidden units are analog and visible units are binary, and we show that thermodynamics of visible units are equivalent to those of a Hopfield network, in which the N visible units are the neurons and the P hidden units are the learned patterns. We apply the method of stochastic stability to derive the thermodynamics of the model, by considering a formal extension of this technique to the case of multiple sets of stored patterns, which may act as a benchmark for the study of correlated sets. Our results imply that simulating the dynamics of a Hopfield network, requiring the update of N neurons and the storage of N(N-1)/2 synapses, can be accomplished by a hybrid Boltzmann Machine, requiring the update of N+P neurons but the storage of only NP synapses. In addition, the well known glass transition of the Hopfield network has a counterpart in the Boltzmann Machine: It corresponds to an optimum criterion for selecting the relative sizes of the hidden and visible layers, resolving the trade-off between flexibility and generality of the model. The low storage phase of the Hopfield model corresponds to few hidden units and hence a overly constrained RBM, while the spin-glass phase (too many hidden units) corresponds to unconstrained RBM prone to overfitting of the observed data.

preprint2010arXiv

Self-consistent method for density estimation

The estimation of a density profile from experimental data points is a challenging problem, usually tackled by plotting a histogram. Prior assumptions on the nature of the density, from its smoothness to the specification of its form, allow the design of more accurate estimation procedures, such as Maximum Likelihood. Our aim is to construct a procedure that makes no explicit assumptions, but still providing an accurate estimate of the density. We introduce the self-consistent estimate: the power spectrum of a candidate density is given, and an estimation procedure is constructed on the assumption, to be released \emph{a posteriori}, that the candidate is correct. The self-consistent estimate is defined as a prior candidate density that precisely reproduces itself. Our main result is to derive the exact expression of the self-consistent estimate for any given dataset, and to study its properties. Applications of the method require neither priors on the form of the density nor the subjective choice of parameters. A cutoff frequency, akin to a bin size or a kernel bandwidth, emerges naturally from the derivation. We apply the self-consistent estimate to artificial data generated from various distributions and show that it reaches the theoretical limit for the scaling of the square error with the dataset size.

preprint2007arXiv

Detecting spatial patterns with the cumulant function. Part I: The theory

In climate studies, detecting spatial patterns that largely deviate from the sample mean still remains a statistical challenge. Although a Principal Component Analysis (PCA), or equivalently a Empirical Orthogonal Functions (EOF) decomposition, is often applied on this purpose, it can only provide meaningful results if the underlying multivariate distribution is Gaussian. Indeed, PCA is based on optimizing second order moments quantities and the covariance matrix can only capture the full dependence structure for multivariate Gaussian vectors. Whenever the application at hand can not satisfy this normality hypothesis (e.g. precipitation data), alternatives and/or improvements to PCA have to be developed and studied. To go beyond this second order statistics constraint that limits the applicability of the PCA, we take advantage of the cumulant function that can produce higher order moments information. This cumulant function, well-known in the statistical literature, allows us to propose a new, simple and fast procedure to identify spatial patterns for non-Gaussian data. Our algorithm consists in maximizing the cumulant function. To illustrate our approach, its implementation for which explicit computations are obtained is performed on three family of of multivariate random vectors. In addition, we show that our algorithm corresponds to selecting the directions along which projected data display the largest spread over the marginal probability density tails.

preprint2007arXiv

Detecting spatial patterns with the cumulant function. Part II: An application to El Nino

The spatial coherence of a measured variable (e.g. temperature or pressure) is often studied to determine the regions where this variable varies the most or to find teleconnections, i.e. correlations between specific regions. While usual methods to find spatial patterns, such as Principal Components Analysis (PCA), are constrained by linear symmetries, the dependence of variables such as temperature or pressure at different locations is generally nonlinear. In particular, large deviations from the sample mean are expected to be strongly affected by such nonlinearities. Here we apply a newly developed nonlinear technique (Maxima of Cumulant Function, MCF) for the detection of typical spatial patterns that largely deviate from the mean. In order to test the technique and to introduce the methodology, we focus on the El Nino/Southern Oscillation and its spatial patterns. We find nonsymmetric temperature patterns corresponding to El Nino and La Nina, and we compare the results of MCF with other techniques, such as the symmetric solutions of PCA, and the nonsymmetric solutions of Nonlinear PCA (NLPCA). We found that MCF solutions are more reliable than the NLPCA fits, and can capture mixtures of principal components. Finally, we apply Extreme Value Theory on the temporal variations extracted from our methodology. We find that the tails of the distribution of extreme temperatures during La Nina episodes is bounded, while the tail during El Ninos is less likely to be bounded. This implies that the mean spatial patterns of the two phases are asymmetric, as well as the behaviour of their extremes.

Alberto Bernacchia

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

How to distribute data across tasks for meta-learning?

Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning

Meta-Learning with MAML on Trees

Decorrelation by recurrent inhibition in heterogeneous neural circuits

On the equivalence of Hopfield Networks and Boltzmann Machines

Self-consistent method for density estimation

Detecting spatial patterns with the cumulant function. Part I: The theory

Detecting spatial patterns with the cumulant function. Part II: An application to El Nino