Researcher profile

Stefano Favaro

Stefano Favaro contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2026arXiv

A Gaussian process limit for the self-normalized Ewens-Pitman process

For an integer $n\geq1$, consider a random partition $Π_{n}$ of $\{1,\ldots,n\}$ into $K_{n}$ partition sets with $K_{r,n}$ partition subsets of size $r=1,\ldots,n$, and assume $Π_{n}$ distributed according to the Ewens-Pitman model with parameters $α\in]0,1[$ and $θ>-α$. Although the large-$n$ asymptotic behaviors of $K_{n}$ and $K_{r,n}$ are well understood in terms of almost sure convergence and Gaussian fluctuations, much less is known about the asymptotic behavior of $P_{r,n}=K_{r,n}/K_n$ and of the self-normalized Ewens-Pitman process $(P_{1,n},P_{2,n},\dots)$. Motivated by the almost sure convergence of $(P_{1,n},P_{2,n},\dots)$ to the Sibuya distribution $p_α=(p_α(1),p_α(2),\ldots)$, where $p_α(r)$ is the probability mass at $r=1,2,\ldots$, we establish the $\ell^{2}$ distributional convergence \begin{displaymath} \sqrt{K_{n}}((P_{1,n},\,P_{2,n},\ldots)-p_α)\underset{n\rightarrow+\infty}{\overset{\cL}{\longrightarrow}}\mathcal{G}(Γ_α), \end{displaymath} where $\mathcal{G}(Γ_α)$ stands for a centered Gaussian process with covariance matrix $Γ_α=diag(p_α) - p_α p_α^T$. We apply our result to the estimation of the parameter

preprint2023arXiv

Large-width asymptotics for ReLU neural networks with $α$-Stable initializations

There is a recent and growing literature on large-width asymptotic properties of Gaussian neural networks (NNs), namely NNs whose weights are initialized as Gaussian distributions. Two popular problems are: i) the study of the large-width distributions of NNs, which characterizes the infinitely wide limit of a rescaled NN in terms of a Gaussian stochastic process; ii) the study of the large-width training dynamics of NNs, which characterizes the infinitely wide dynamics in terms of a deterministic kernel, referred to as the neural tangent kernel (NTK), and shows that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. In this paper, we consider these problems for $α$-Stable NNs, namely NNs whose weights are initialized as $α$-Stable distributions with $α\in(0,2]$. First, for $α$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $α$-Stable stochastic process. As a difference with respect to the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $α$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we study the large-width training dynamics of $α$-Stable ReLU-NNs, characterizing the infinitely wide dynamics in terms of a random kernel, referred to as the $α$-Stable NTK, and showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $α$-Stable NTK is a further difference with respect to the Gaussian setting, that is: within the $α$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.

preprint2022arXiv

A new approach to posterior contraction rates via Wasserstein dynamics

This paper presents a new approach to the classical problem of quantifying posterior contraction rates (PCRs) in Bayesian statistics. Our approach relies on Wasserstein distance, and it leads to two main contributions which improve on the existing literature of PCRs. The first contribution exploits the dynamic formulation of Wasserstein distance, for short referred to as Wasserstein dynamics, in order to establish PCRs under dominated Bayesian statistical models. As a novelty with respect to existing approaches to PCRs, Wasserstein dynamics allows us to circumvent the use of sieves in both stating and proving PCRs, and it sets forth a natural connection between PCRs and three well-known classical problems in statistics and probability theory: the speed of mean Glivenko-Cantelli convergence, the estimation of weighted Poincaré-Wirtinger constants and Sanov large deviation principle for Wasserstein distance. The second contribution combines the use of Wasserstein distance with a suitable sieve construction to establish PCRs under full Bayesian nonparametric models. As a novelty with respect to existing literature of PCRs, our second result provides with the first treatment of PCRs under non-dominated Bayesian models. Applications of our results are presented for some classical Bayesian statistical models, e.g., regular parametric models, infinite-dimensional exponential families, linear regression in infinite dimension and nonparametric models under Dirichlet process priors.

preprint2022arXiv

Deep Stable neural networks: large-width asymptotics and convergence rates

In modern deep learning, there is a recent and growing literature on the interplay between large-width asymptotic properties of deep Gaussian neural networks (NNs), i.e. deep NNs with Gaussian-distributed weights, and Gaussian stochastic processes (SPs). Such an interplay has proved to be critical in Bayesian inference under Gaussian SP priors, kernel regression for infinitely wide deep NNs trained via gradient descent, and information propagation within infinitely wide NNs. Motivated by empirical analyses that show the potential of replacing Gaussian distributions with Stable distributions for the NN's weights, in this paper we present a rigorous analysis of the large-width asymptotic behaviour of (fully connected) feed-forward deep Stable NNs, i.e. deep NNs with Stable-distributed weights. We show that as the width goes to infinity jointly over the NN's layers, i.e. the ``joint growth" setting, a rescaled deep Stable NN converges weakly to a Stable SP whose distribution is characterized recursively through the NN's layers. Because of the non-triangular structure of the NN, this is a non-standard asymptotic problem, to which we propose an inductive approach of independent interest. Then, we establish sup-norm convergence rates of the rescaled deep Stable NN to the Stable SP, under the ``joint growth" and a ``sequential growth" of the width over the NN's layers. Such a result provides the difference between the ``joint growth" and the ``sequential growth" settings, showing that the former leads to a slower rate than the latter, depending on the depth of the layer and the number of inputs of the NN. Our work extends some recent results on infinitely wide limits for deep Gaussian NNs to the more general deep Stable NNs, providing the first result on convergence rates in the ``joint growth" setting.

preprint2022arXiv

Learning-augmented count-min sketches via Bayesian nonparametrics

The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMS-DP, has been proposed by Cai, Mitzenmacher and Adams (\textit{NeurIPS} 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being obtained as suitable mean functionals of the posterior distribution of the point query, given the hashed data. While the CMS-DP has proved to improve on some aspects of CMS, it has the major drawback of arising from a ``constructive" proof that builds upon arguments tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we present a ``Bayesian" proof of the CMS-DP that has the main advantage of building upon arguments that are usable, in principle, within a broad class of nonparametric priors arising from normalized completely random measures. This result leads to develop a novel learning-augmented CMS under power-law data streams, referred to as CMS-PYP, which relies on BNP modeling of the data stream of tokens via a Pitman-Yor process (PYP) prior. Under this more general framework, we apply the arguments of the ``Bayesian" proof of the CMS-DP, suitably adapted to the PYP prior, in order to compute the posterior distribution of a point query, given the hashed data. Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed for low-frequency tokens. An extension of our BNP approach to more general queries is also discussed.

preprint2022arXiv

Near-optimal estimation of the unseen under regularly varying tail populations

Given $n$ samples from a population of individuals belonging to different species, what is the number $U$ of hitherto unseen species that would be observed if $λn$ new samples were collected? This is an important problem in many scientific endeavors, and it has been the subject of recent works introducing non-parametric estimators of $U$ that are minimax near-optimal and consistent all the way up to $λ\asymp\log n$. These works do not rely on any assumption on the underlying unknown distribution $p$ of the population, and therefore, while providing a theory in its greatest generality, worst-case distributions may severely hamper the estimation of $U$ in concrete applications. In this paper, we consider the problem of strengthening the non-parametric framework for estimating $U$. Inspired by the estimation of rare probabilities in extreme value theory, and motivated by the ubiquitous power-law type distributions in many natural and social phenomena, we make use of a semi-parametric assumption regular variation of index $α\in (0,1)$ for the tail behaviour of $p$. Under this assumption, we introduce an estimator of $U$ that is simple, linear in the sampling information, computationally efficient, and scalable to massive datasets. Then, uniformly over our class of regularly varying tail distributions, we show that the proposed estimator has provable guarantees: i) it is minimax near-optimal, up to a power of $\log n$ factor; ii) it is consistent all of the way up to $\logλ\asymp n^{α/2}/\sqrt{\log n}$, and this range is the best possible. This work presents the first study on the estimation of the unseen under regularly varying tail distributions. A numerical illustration of our methodology is presented for synthetic data and real data.

preprint2022arXiv

The power of private likelihood-ratio tests for goodness-of-fit in frequency tables

Privacy-protecting data analysis investigates statistical methods under privacy constraints. This is a rising challenge in modern statistics, as the achievement of confidentiality guarantees, which typically occurs through suitable perturbations of the data, may determine a loss in the statistical utility of the data. In this paper, we consider privacy-protecting tests for goodness-of-fit in frequency tables, this being arguably the most common form of releasing data, and present a rigorous analysis of the large sample behaviour of a private likelihood-ratio (LR) test. Under the framework of $(\varepsilon,δ)$-differential privacy for perturbed data, our main contribution is the power analysis of the private LR test, which characterizes the trade-off between confidentiality, measured via the differential privacy parameters $(\varepsilon,δ)$, and statistical utility, measured via the power of the test. This is obtained through a Bahadur-Rao large deviation expansion for the power of the private LR test, bringing out a critical quantity, as a function of the sample size, the dimension of the table and $(\varepsilon,δ)$, that determines a loss in the power of the test. Such a result is then applied to characterize the impact of the sample size and the dimension of the table, in connection with the parameters $(\varepsilon,δ)$, on the loss of the power of the private LR test. In particular, we determine the (sample) cost of $(\varepsilon,δ)$-differential privacy in the private LR test, namely the additional sample size that is required to recover the power of the Multinomial LR test in the absence of perturbation. Our power analysis rely on a non-standard large deviation analysis for the LR, as well as the development of a novel (sharp) large deviation principle for sum of i.i.d. random vectors, which is of independent interest.

preprint2022arXiv

Wasserstein posterior contraction rates in non-dominated Bayesian nonparametric models

Posterior contractions rates (PCRs) strengthen the notion of Bayesian consistency, quantifying the speed at which the posterior distribution concentrates on arbitrarily small neighborhoods of the true model, with probability tending to 1 or almost surely, as the sample size goes to infinity. Under the Bayesian nonparametric framework, a common assumption in the study of PCRs is that the model is dominated for the observations; that is, it is assumed that the posterior can be written through the Bayes formula. In this paper, we consider the problem of establishing PCRs in Bayesian nonparametric models where the posterior distribution is not available through the Bayes formula, and hence models that are non-dominated for the observations. By means of the Wasserstein distance and a suitable sieve construction, our main result establishes PCRs in Bayesian nonparametric models where the posterior is available through a more general disintegration than the Bayes formula. To the best of our knowledge, this is the first general approach to provide PCRs in non-dominated Bayesian nonparametric models, and it relies on minimal modeling assumptions and on a suitable continuity assumption for the posterior distribution. Some refinements of our result are presented under additional assumptions on the prior distribution, and applications are given with respect to the Dirichlet process prior and the normalized extended Gamma process prior.

preprint2021arXiv

A Bayesian nonparametric approach to count-min sketch under power-law data streams

The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens' frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token's frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.

preprint2021arXiv

More for less: Predicting and maximizing genetic variant discovery via Bayesian nonparametrics

While the cost of sequencing genomes has decreased dramatically in recent years, this expense often remains non-trivial. Under a fixed budget, then, scientists face a natural trade-off between quantity and quality; they can spend resources to sequence a greater number of genomes (quantity) or spend resources to sequence genomes with increased accuracy (quality). Our goal is to find the optimal allocation of resources between quantity and quality. Optimizing resource allocation promises to reveal as many new variations in the genome as possible, and thus as many new scientific insights as possible. In this paper, we consider the common setting where scientists have already conducted a pilot study to reveal variants in a genome and are contemplating a follow-up study. We introduce a Bayesian nonparametric methodology to predict the number of new variants in the follow-up study based on the pilot study. When experimental conditions are kept constant between the pilot and follow-up, we demonstrate on real data from the gnomAD project that our prediction is more accurate than three recent proposals, and competitive with a more classic proposal. Unlike existing methods, though, our method allows practitioners to change experimental conditions between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used for (i) more realistic predictions and (ii) optimal allocation of a fixed budget between quality and quantity.

preprint2020arXiv

A Berry-Esseen theorem for Pitman's $α$-diversity

This paper is concerned with the study of the random variable $K_n$ denoting the number of distinct elements in a random sample $(X_1, \dots, X_n)$ of exchangeable random variables driven by the two parameter Poisson-Dirichlet distribution, $PD(α,θ)$. For $α\in(0,1)$, Theorem 3.8 in \cite{Pit(06)} shows that $\frac{K_n}{n^α}\stackrel{\text{a.s.}}{\longrightarrow} S_{α,θ}$ as $n\rightarrow+\infty$. Here, $S_{α,θ}$ is a random variable distributed according to the so-called scaled Mittag-Leffler distribution. Our main result states that $$ \sup_{x \geq 0} \Big| \ppsf\Big[\frac{K_n}{n^α} \leq x \Big] - \ppsf[S_{α,θ} \leq x] \Big| \leq \frac{C(α, θ)}{n^α} $$ holds with an explicit constant $C(α, θ)$. The key ingredients of the proof are a novel probabilistic representation of $K_n$ as compound distribution and new, refined versions of certain quantitative bounds for the Poisson approximation and the compound Poisson distribution.

preprint2020arXiv

A Good-Turing estimator for feature allocation models

Feature allocation models generalize species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, given $n$ samples, we study the problem of estimating the missing mass $M_{n}$, namely the expected number hitherto unseen features that would be observed if one additional individual was sampled. This is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We introduce a simple, robust and theoretically sound nonparametric estimator $\hat{M}_{n}$ of $M_{n}$. $\hat{M}_{n}$ turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass in species sampling models, with the difference that the two estimators have different ranges. We show that $\hat{M}_{n}$ admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator, we give provable guarantees for the performance of $\hat{M}_{n}$ in terms of minimax rate optimality, and we provide with an interesting connection between $\hat{M}_{n}$ and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for $\hat{M}_{n}$, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.

preprint2020arXiv

Approximating predictive probabilities of Gibbs-type priors

Gibbs-type random probability measures, or Gibbs-type priors, are arguably the most "natural" generalization of the celebrated Dirichlet prior. Among them the two parameter Poisson-Dirichlet prior certainly stands out for the mathematical tractability and interpretability of its predictive probabilities, which made it the natural candidate in several applications. Given a sample of size $n$, in this paper we show that the predictive probabilities of any Gibbs-type prior admit a large $n$ approximation, with an error term vanishing as $o(1/n)$, which maintains the same desirable features as the predictive probabilities of the two parameter Poisson-Dirichlet prior.

preprint2020arXiv

De Finetti's theorem: rate of convergence in Kolmogorov distance

This paper provides a quantitative version of de Finetti law of large numbers. Given an infinite sequence $\{X_n\}_{n \geq 1}$ of exchangeable Bernoulli variables, it is well-known that $\frac{1}{n} \sum_{i = 1}^n X_i \stackrel{a.s.}{\longrightarrow} Y$, for a suitable random variable $Y$ taking values in $[0,1]$. Here, we consider the rate of convergence in law of $\frac{1}{n} \sum_{i = 1}^n X_i$ towards $Y$, with respect to the Kolmogorov distance. After showing that any rate of the type of $1/n^α$ can be obtained for any $α\in (0,1]$, we find a sufficient condition on the probability distribution of $Y$ for the achievement of the optimal rate of convergence, that is $1/n$. Our main result improve on existing literature: in particular, with respect to \cite{MPS}, we study a stronger metric while, with respect to \cite{Mna}, we weaken the regularity hypothesis on the probability distribution of $Y$.

preprint2020arXiv

Infinitely deep neural networks as diffusion processes

When the parameters are independently and identically distributed (initialized) neural networks exhibit undesirable properties that emerge as the number of layers increases, e.g. a vanishing dependency on the input and a concentration on restrictive families of functions including constant functions. We consider parameter distributions that shrink as the number of layers increases in order to recover well-behaved stochastic processes in the limit of infinite depth. This leads to set forth a link between infinitely deep residual networks and solutions to stochastic differential equations, i.e. diffusion processes. We show that these limiting processes do not suffer from the aforementioned issues and investigate their properties.

preprint2020arXiv

Nonparametric Bayesian multi-armed bandits for single cell experiment design

The problem of maximizing cell type discovery under budget constraints is a fundamental challenge for the collection and analysis of single-cell RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple, computationally efficient, and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large scale experiment for the collection of scRNA-seq data for the purpose of, but not limited to, creating cell atlases. Our approach relies on the following tools: i) a hierarchical Pitman-Yor prior that recapitulates biological assumptions regarding cellular differentiation, and ii) a Thompson sampling multi-armed bandit strategy that balances exploitation and exploration to prioritize experiments across a sequence of trials. Posterior inference is performed by using a sequential Monte Carlo approach, which allows us to fully exploit the sequential nature of our species sampling problem. We empirically show that our approach outperforms state-of-the-art methods and achieves near-Oracle performance on simulated and scRNA-seq data alike. HPY-TS code is available at https://github.com/fedfer/HPYsinglecell.

preprint2020arXiv

Stable behaviour of infinitely wide deep neural networks

We consider fully connected feed-forward deep neural networks (NNs) where weights and biases are independent and identically distributed as symmetric centered stable distributions. Then, we show that the infinite wide limit of the NN, under suitable scaling on the weights, is a stochastic process whose finite-dimensional distributions are multivariate stable distributions. The limiting process is referred to as the stable process, and it generalizes the class of Gaussian processes recently obtained as infinite wide limits of NNs (Matthews at al., 2018b). Parameters of the stable process can be computed via an explicit recursion over the layers of the network. Our result contributes to the theory of fully connected feed-forward deep NNs, and it paves the way to expand recent lines of research that rely on Gaussian infinite wide limits.