Researcher profile

Massimiliano Pontil

Massimiliano Pontil contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
22works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

22 published item(s)

preprint2026arXiv

kooplearn: A Scikit-Learn Compatible Library of Algorithms for Evolution Operator Learning

kooplearn is a machine-learning library that implements linear, kernel, and deep-learning estimators of dynamical operators and their spectral decompositions. kooplearn can model both discrete-time evolution operators (Koopman/Transfer) and continuous-time infinitesimal generators. By learning these operators, users can analyze dynamical systems via spectral methods, derive data-driven reduced-order models, and forecast future states and observables. kooplearn's interface is compliant with the scikit-learn API, facilitating its integration into existing machine learning and data science workflows. Additionally, kooplearn includes curated benchmark datasets to support experimentation, reproducibility, and the fair comparison of learning algorithms. The software is available at https://github.com/Machine-Learning-Dynamical-Systems/kooplearn.

preprint2026arXiv

SpectraFormer: an Attention-Based Raman Unmixing Tool for Accessing the Graphene Buffer-Layer Signature on SiC

Raman spectroscopy is a key tool for graphene characterization, yet its application to graphene grown on silicon carbide (SiC) is strongly limited by the intense and variable second-order Raman response of the substrate. This limitation is critical for buffer layer graphene, a semiconducting interfacial phase, whose vibrational signatures are overlapped with the SiC background and challenging to be reliably accessed using conventional reference-based subtraction, due to strong spatial and experimental variability of the substrate signal. Here we present SpectraFormer, a transformer-based deep learning model that reconstructs the SiC Raman substrate contribution directly from post-growth partially masked spectroscopic data without relying on explicit reference measurements. By learning global correlations across the entire Raman shift range, the model captures the statistical structure of the SiC background and enables accurate reconstruction of its contribution in mixed spectra. Subtraction of the reconstructed substrate signal reveals weak vibrational features associated with ZLG that are inaccessible through conventional analysis methods. The extracted spectra are validated by ab initio vibrational calculations, allowing assignment of the resolved features to specific modes and confirming their physical consistency. By leveraging a state-of-the-art attention-based deep learning architecture, this approach establishes a robust, reference-free framework for Raman analysis of graphene on SiC and provides a foundation, compatible with real-time data acquisition, to its integration into automated, closed-loop AI-assisted growth optimization.

preprint2023arXiv

A randomized algorithm to solve reduced rank operator regression

We present and analyze an algorithm designed for addressing vector-valued regression problems involving possibly infinite-dimensional input and output spaces. The algorithm is a randomized adaptation of reduced rank regression, a technique to optimally learn a low-rank vector-valued function (i.e. an operator) between sampled data via regularized empirical risk minimization with rank constraints. We propose Gaussian sketching techniques both for the primal and dual optimization objectives, yielding Randomized Reduced Rank Regression (R4) estimators that are efficient and accurate. For each of our R4 algorithms we prove that the resulting regularized empirical risk is, in expectation w.r.t. randomness of a sketch, arbitrarily close to the optimal value when hyper-parameteres are properly tuned. Numerical expreriments illustrate the tightness of our bounds and show advantages in two distinct scenarios: (i) solving a vector-valued regression problem using synthetic and large-scale neuroscience datasets, and (ii) regressing the Koopman operator of a nonlinear stochastic dynamical system.

preprint2022arXiv

Distribution Regression with Sliced Wasserstein Kernels

The problem of learning functions over spaces of probabilities - or distribution regression - is gaining significant interest in the machine learning community. A key challenge behind this problem is to identify a suitable representation capturing all relevant properties of the underlying functional mapping. A principled approach to distribution regression is provided by kernel mean embeddings, which lifts kernel-induced similarity on the input domain at the probability level. This strategy effectively tackles the two-stage sampling nature of the problem, enabling one to derive estimators with strong statistical guarantees, such as universal consistency and excess risk bounds. However, kernel mean embeddings implicitly hinge on the maximum mean discrepancy (MMD), a metric on probabilities, which may fail to capture key geometrical relations between distributions. In contrast, optimal transport (OT) metrics, are potentially more appealing. In this work, we propose an OT-based estimator for distribution regression. We build on the Sliced Wasserstein distance to obtain an OT-based representation. We study the theoretical properties of a kernel ridge regression estimator based on such representation, for which we prove universal consistency and excess risk bounds. Preliminary experiments complement our theoretical findings by showing the effectiveness of the proposed approach and compare it with MMD-based estimators.

preprint2022arXiv

Meta Representation Learning with Contextual Linear Bandits

Meta-learning seeks to build algorithms that rapidly learn how to solve new learning problems based on previous experience. In this paper we investigate meta-learning in the setting of stochastic linear bandit tasks. We assume that the tasks share a low dimensional representation, which has been partially acquired from previous learning tasks. We aim to leverage this information in order to learn a new downstream bandit task, which shares the same representation. Our principal contribution is to show that if the learned representation estimates well the unknown one, then the downstream task can be efficiently learned by a greedy policy that we propose in this work. We derive an upper bound on the regret of this policy, which is, up to logarithmic factors, of order $r\sqrt{N}(1\vee \sqrt{d/T})$, where $N$ is the horizon of the downstream task, $T$ is the number of training tasks, $d$ the ambient dimension and $r \ll d$ the dimension of the representation. We highlight that our strategy does not need to know $r$. We note that if $T> d$ our bound achieves the same rate of optimal minimax bandit algorithms using the true underlying representation. Our analysis is inspired and builds in part upon previous work on meta-learning in the i.i.d. full information setting \citep{tripuraneni2021provable,boursier2022trace}. As a separate contribution we show how to relax certain assumptions in those works, thereby improving their representation learning and risk analysis.

preprint2022arXiv

Multi-source Domain Adaptation via Weighted Joint Distributions Optimal Transport

The problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets is becoming increasingly important. A key challenge is to design an approach that overcomes the covariate and target shift both among the sources, and between the source and target domains. In this paper, we address this problem from a new perspective: instead of looking for a latent representation invariant between source and target domains, we exploit the diversity of source distributions by tuning their weights to the target task at hand. Our method, named Weighted Joint Distribution Optimal Transport (WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the source and target distributions and a re-weighting of the sources distributions. We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves state-of-the-art performance on simulated and real-life datasets.

preprint2021arXiv

Distance-Based Regularisation of Deep Networks for Fine-Tuning

We investigate approaches to regularisation during fine-tuning of deep neural networks. First we provide a neural network generalisation bound based on Rademacher complexity that uses the distance the weights have moved from their initial values. This bound has no direct dependence on the number of weights and compares favourably to other bounds when applied to convolutional networks. Our bound is highly relevant for fine-tuning, because providing a network with a good initialisation based on transfer learning means that learning can modify the weights less, and hence achieve tighter generalisation. Inspired by this, we develop a simple yet effective fine-tuning algorithm that constrains the hypothesis class to a small sphere centred on the initial pre-trained weights, thus obtaining provably better generalisation performance than conventional transfer learning. Empirical evaluation shows that our algorithm works well, corroborating our theoretical results. It outperforms both state of the art fine-tuning competitors, and penalty-based alternatives that we show do not directly constrain the radius of the search space.

preprint2021arXiv

Robust Unsupervised Learning via L-Statistic Minimization

Designing learning algorithms that are resistant to perturbations of the underlying data distribution is a problem of wide practical and theoretical importance. We present a general approach to this problem focusing on unsupervised learning. The key assumption is that the perturbing distribution is characterized by larger losses relative to a given class of admissible models. This is exploited by a general descent algorithm which minimizes an $L$-statistic criterion over the model class, weighting small losses more. Our analysis characterizes the robustness of the method in terms of bounds on the reconstruction error relative to the underlying unperturbed distribution. As a byproduct, we prove uniform convergence bounds with respect to the proposed criterion for several popular models in unsupervised learning, a result which may be of independent interest.Numerical experiments with kmeans clustering and principal subspace analysis demonstrate the effectiveness of our approach.

preprint2020arXiv

Approximating Hamiltonian dynamics with the Nyström method

Simulating the time-evolution of quantum mechanical systems is BQP-hard and expected to be one of the foremost applications of quantum computers. We consider classical algorithms for the approximation of Hamiltonian dynamics using subsampling methods from randomized numerical linear algebra. We derive a simulation technique whose runtime scales polynomially in the number of qubits and the Frobenius norm of the Hamiltonian. As an immediate application, we show that sample based quantum simulation, a type of evolution where the Hamiltonian is a density matrix, can be efficiently classically simulated under specific structural conditions. Our main technical contribution is a randomized algorithm for approximating Hermitian matrix exponentials. The proof leverages a low-rank, symmetric approximation via the Nyström method. Our results suggest that under strong sampling assumptions there exist classical poly-logarithmic time simulations of quantum computations.

preprint2020arXiv

Efficient Tensor Kernel methods for sparse regression

Recently, classical kernel methods have been extended by the introduction of suitable tensor kernels so to promote sparsity in the solution of the underlying regression problem. Indeed, they solve an lp-norm regularization problem, with p=m/(m-1) and m even integer, which happens to be close to a lasso problem. However, a major drawback of the method is that storing tensors requires a considerable amount of memory, ultimately limiting its applicability. In this work we address this problem by proposing two advances. First, we directly reduce the memory requirement, by intriducing a new and more efficient layout for storing the data. Second, we use a Nystrom-type subsampling approach, which allows for a training phase with a smaller number of data points, so to reduce the computational cost. Experiments, both on synthetic and read datasets, show the effectiveness of the proposed improvements. Finally, we take case of implementing the cose in C++ so to further speed-up the computation.

preprint2020arXiv

Empirical Risk Minimization under Fairness Constraints

We address the problem of algorithmic fairness: ensuring that sensitive variables do not unfairly influence the outcome of a classifier. We present an approach based on empirical risk minimization, which incorporates a fairness constraint into the learning problem. It encourages the conditional risk of the learned classifier to be approximately constant with respect to the sensitive variable. We derive both risk and fairness bounds that support the statistical consistency of our approach. We specify our approach to kernel methods and observe that the fairness requirement implies an orthogonality constraint which can be easily added to these methods. We further observe that for linear models the constraint translates into a simple data preprocessing step. Experiments indicate that the method is empirically effective and performs favorably against state-of-the-art approaches.

preprint2020arXiv

Fair Regression with Wasserstein Barycenters

We study the problem of learning a real-valued function that satisfies the Demographic Parity constraint. It demands the distribution of the predicted output to be independent of the sensitive attribute. We consider the case that the sensitive attribute is available for prediction. We establish a connection between fair regression and optimal transport theory, based on which we derive a close form expression for the optimal fair predictor. Specifically, we show that the distribution of this optimum is the Wasserstein barycenter of the distributions induced by the standard regression function on the sensitive groups. This result offers an intuitive interpretation of the optimal fair prediction and suggests a simple post-processing algorithm to achieve fairness. We establish risk and distribution-free fairness guarantees for this procedure. Numerical experiments indicate that our method is very effective in learning fair models, with a relative increase in error rate that is inferior to the relative gain in fairness.

preprint2020arXiv

Generalization Properties of Optimal Transport GANs with Latent Distribution Learning

The Generative Adversarial Networks (GAN) framework is a well-established paradigm for probability matching and realistic sample generation. While recent attention has been devoted to studying the theoretical properties of such models, a full theoretical understanding of the main building blocks is still missing. Focusing on generative models with Optimal Transport metrics as discriminators, in this work we study how the interplay between the latent distribution and the complexity of the pushforward map (generator) affects performance, from both statistical and modelling perspectives. Motivated by our analysis, we advocate learning the latent distribution as well as the pushforward map within the GAN paradigm. We prove that this can lead to significant advantages in terms of sample complexity.

preprint2020arXiv

Learning Discrete Structures for Graph Neural Networks

Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.

preprint2020arXiv

Learning Fair and Transferable Representations

Developing learning methods which do not discriminate subgroups in the population is a central goal of algorithmic fairness. One way to reach this goal is by modifying the data representation in order to meet certain fairness constraints. In this work we measure fairness according to demographic parity. This requires the probability of the possible model decisions to be independent of the sensitive information. We argue that the goal of imposing demographic parity can be substantially facilitated within a multitask learning setting. We leverage task similarities by encouraging a shared fair representation across the tasks via low rank matrix factorization. We derive learning bounds establishing that the learned representation transfers well to novel tasks both in terms of prediction performance and fairness metrics. We present experiments on three real world datasets, showing that the proposed method outperforms state-of-the-art approaches by a significant margin.

preprint2020arXiv

Leveraging Labeled and Unlabeled Data for Consistent Fair Binary Classification

We study the problem of fair binary classification using the notion of Equal Opportunity. It requires the true positive rate to distribute equally across the sensitive groups. Within this setting we show that the fair optimal classifier is obtained by recalibrating the Bayes classifier by a group-dependent threshold. We provide a constructive expression for the threshold. This result motivates us to devise a plug-in classification procedure based on both unlabeled and labeled datasets. While the latter is used to learn the output conditional probability, the former is used for calibration. The overall procedure can be computed in polynomial time and it is shown to be statistically consistent both in terms of the classification error and fairness measure. Finally, we present numerical experiments which indicate that our method is often superior or competitive with the state-of-the-art methods on benchmark datasets.

preprint2020arXiv

MARTHE: Scheduling the Learning Rate Via Online Hypergradients

We study the problem of fitting task-specific learning rate schedules from the perspective of hyperparameter optimization, aiming at good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate schedule -- the hypergradient. Based on this, we introduce MARTHE, a novel online algorithm guided by cheap approximations of the hypergradient that uses past information from the optimization trajectory to simulate future behaviour. It interpolates between two recent techniques, RTHO (Franceschi et al., 2017) and HD (Baydin et al. 2018), and is able to produce learning rate schedules that are more stable leading to models that generalize better.

preprint2020arXiv

Meta-learning with Stochastic Linear Bandits

We investigate meta-learning procedures in the setting of stochastic linear bandits tasks. The goal is to select a learning algorithm which works well on average over a class of bandits tasks, that are sampled from a task-distribution. Inspired by recent work on learning-to-learn linear regression, we consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We first study the benefit of the biased OFUL algorithm in terms of regret minimization. We then propose two strategies to estimate the bias within the learning-to-learn setting. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.

preprint2020arXiv

On the Iteration Complexity of Hypergradient Computation

We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We investigate some popular approaches to compute the hypergradient, based on reverse mode iterative differentiation and approximate implicit differentiation. Under the hypothesis that the fixed point equation is defined by a contraction mapping, we present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. This analysis suggests a hierarchy in terms of computational efficiency among the above methods, with approximate implicit differentiation based on conjugate gradient performing best. We present an extensive experimental comparison among the methods which confirm the theoretical findings.

preprint2020arXiv

Online Parameter-Free Learning of Multiple Low Variance Tasks

We propose a method to learn a common bias vector for a growing sequence of low-variance tasks. Unlike state-of-the-art approaches, our method does not require tuning any hyper-parameter. Our approach is presented in the non-statistical setting and can be of two variants. The "aggressive" one updates the bias after each datapoint, the "lazy" one updates the bias only at the end of each task. We derive an across-tasks regret bound for the method. When compared to state-of-the-art approaches, the aggressive variant returns faster rates, the lazy one recovers standard rates, but with no need of tuning hyper-parameters. We then adapt the methods to the statistical setting: the aggressive variant becomes a multi-task learning method, the lazy one a meta-learning method. Experiments confirm the effectiveness of our methods in practice.

preprint2020arXiv

Taking Advantage of Multitask Learning for Fair Classification

A central goal of algorithmic fairness is to reduce bias in automated decision making. An unavoidable tension exists between accuracy gains obtained by using sensitive information (e.g., gender or ethnic group) as part of a statistical model, and any commitment to protect these characteristics. Often, due to biases present in the data, using the sensitive information in the functional form of a classifier improves classification accuracy. In this paper we show how it is possible to get the best of both worlds: optimize model accuracy and fairness without explicitly using the sensitive feature in the functional form of the model, thereby treating different individuals equally. Our method is based on two key ideas. On the one hand, we propose to use Multitask Learning (MTL), enhanced with fairness constraints, to jointly learn group specific classifiers that leverage information between sensitive groups. On the other hand, since learning group specific models might not be permitted, we propose to first predict the sensitive features by any learning method and then to use the predicted sensitive feature to train MTL with fairness constraints. This enables us to tackle fairness with a three-pronged approach, that is, by increasing accuracy on each group, enforcing measures of fairness during training, and protecting sensitive information during testing. Experimental results on two real datasets support our proposal, showing substantial improvements in both accuracy and fairness.

preprint2020arXiv

The Advantage of Conditional Meta-Learning for Biased Regularization and Fine-Tuning

Biased regularization and fine-tuning are two recent meta-learning approaches. They have been shown to be effective to tackle distributions of tasks, in which the tasks' target vectors are all close to a common meta-parameter vector. However, these methods may perform poorly on heterogeneous environments of tasks, where the complexity of the tasks' distribution cannot be captured by a single meta-parameter vector. We address this limitation by conditional meta-learning, inferring a conditioning function mapping task's side information into a meta-parameter vector that is appropriate for that task at hand. We characterize properties of the environment under which the conditional approach brings a substantial advantage over standard meta-learning and we highlight examples of environments, such as those with multiple clusters, satisfying these properties. We then propose a convex meta-algorithm providing a comparable advantage also in practice. Numerical experiments confirm our theoretical findings.