Researcher profile

Emanuele Dolera

Emanuele Dolera contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2022arXiv

A new approach to posterior contraction rates via Wasserstein dynamics

This paper presents a new approach to the classical problem of quantifying posterior contraction rates (PCRs) in Bayesian statistics. Our approach relies on Wasserstein distance, and it leads to two main contributions which improve on the existing literature of PCRs. The first contribution exploits the dynamic formulation of Wasserstein distance, for short referred to as Wasserstein dynamics, in order to establish PCRs under dominated Bayesian statistical models. As a novelty with respect to existing approaches to PCRs, Wasserstein dynamics allows us to circumvent the use of sieves in both stating and proving PCRs, and it sets forth a natural connection between PCRs and three well-known classical problems in statistics and probability theory: the speed of mean Glivenko-Cantelli convergence, the estimation of weighted Poincaré-Wirtinger constants and Sanov large deviation principle for Wasserstein distance. The second contribution combines the use of Wasserstein distance with a suitable sieve construction to establish PCRs under full Bayesian nonparametric models. As a novelty with respect to existing literature of PCRs, our second result provides with the first treatment of PCRs under non-dominated Bayesian models. Applications of our results are presented for some classical Bayesian statistical models, e.g., regular parametric models, infinite-dimensional exponential families, linear regression in infinite dimension and nonparametric models under Dirichlet process priors.

preprint2022arXiv

Learning-augmented count-min sketches via Bayesian nonparametrics

The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMS-DP, has been proposed by Cai, Mitzenmacher and Adams (\textit{NeurIPS} 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being obtained as suitable mean functionals of the posterior distribution of the point query, given the hashed data. While the CMS-DP has proved to improve on some aspects of CMS, it has the major drawback of arising from a ``constructive" proof that builds upon arguments tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we present a ``Bayesian" proof of the CMS-DP that has the main advantage of building upon arguments that are usable, in principle, within a broad class of nonparametric priors arising from normalized completely random measures. This result leads to develop a novel learning-augmented CMS under power-law data streams, referred to as CMS-PYP, which relies on BNP modeling of the data stream of tokens via a Pitman-Yor process (PYP) prior. Under this more general framework, we apply the arguments of the ``Bayesian" proof of the CMS-DP, suitably adapted to the PYP prior, in order to compute the posterior distribution of a point query, given the hashed data. Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed for low-frequency tokens. An extension of our BNP approach to more general queries is also discussed.

preprint2022arXiv

The power of private likelihood-ratio tests for goodness-of-fit in frequency tables

Privacy-protecting data analysis investigates statistical methods under privacy constraints. This is a rising challenge in modern statistics, as the achievement of confidentiality guarantees, which typically occurs through suitable perturbations of the data, may determine a loss in the statistical utility of the data. In this paper, we consider privacy-protecting tests for goodness-of-fit in frequency tables, this being arguably the most common form of releasing data, and present a rigorous analysis of the large sample behaviour of a private likelihood-ratio (LR) test. Under the framework of $(\varepsilon,δ)$-differential privacy for perturbed data, our main contribution is the power analysis of the private LR test, which characterizes the trade-off between confidentiality, measured via the differential privacy parameters $(\varepsilon,δ)$, and statistical utility, measured via the power of the test. This is obtained through a Bahadur-Rao large deviation expansion for the power of the private LR test, bringing out a critical quantity, as a function of the sample size, the dimension of the table and $(\varepsilon,δ)$, that determines a loss in the power of the test. Such a result is then applied to characterize the impact of the sample size and the dimension of the table, in connection with the parameters $(\varepsilon,δ)$, on the loss of the power of the private LR test. In particular, we determine the (sample) cost of $(\varepsilon,δ)$-differential privacy in the private LR test, namely the additional sample size that is required to recover the power of the Multinomial LR test in the absence of perturbation. Our power analysis rely on a non-standard large deviation analysis for the LR, as well as the development of a novel (sharp) large deviation principle for sum of i.i.d. random vectors, which is of independent interest.

preprint2022arXiv

Wasserstein posterior contraction rates in non-dominated Bayesian nonparametric models

Posterior contractions rates (PCRs) strengthen the notion of Bayesian consistency, quantifying the speed at which the posterior distribution concentrates on arbitrarily small neighborhoods of the true model, with probability tending to 1 or almost surely, as the sample size goes to infinity. Under the Bayesian nonparametric framework, a common assumption in the study of PCRs is that the model is dominated for the observations; that is, it is assumed that the posterior can be written through the Bayes formula. In this paper, we consider the problem of establishing PCRs in Bayesian nonparametric models where the posterior distribution is not available through the Bayes formula, and hence models that are non-dominated for the observations. By means of the Wasserstein distance and a suitable sieve construction, our main result establishes PCRs in Bayesian nonparametric models where the posterior is available through a more general disintegration than the Bayes formula. To the best of our knowledge, this is the first general approach to provide PCRs in non-dominated Bayesian nonparametric models, and it relies on minimal modeling assumptions and on a suitable continuity assumption for the posterior distribution. Some refinements of our result are presented under additional assumptions on the prior distribution, and applications are given with respect to the Dirichlet process prior and the normalized extended Gamma process prior.

preprint2021arXiv

A Bayesian nonparametric approach to count-min sketch under power-law data streams

The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens' frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token's frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.

preprint2020arXiv

A Berry-Esseen theorem for Pitman's $α$-diversity

This paper is concerned with the study of the random variable $K_n$ denoting the number of distinct elements in a random sample $(X_1, \dots, X_n)$ of exchangeable random variables driven by the two parameter Poisson-Dirichlet distribution, $PD(α,θ)$. For $α\in(0,1)$, Theorem 3.8 in \cite{Pit(06)} shows that $\frac{K_n}{n^α}\stackrel{\text{a.s.}}{\longrightarrow} S_{α,θ}$ as $n\rightarrow+\infty$. Here, $S_{α,θ}$ is a random variable distributed according to the so-called scaled Mittag-Leffler distribution. Our main result states that $$ \sup_{x \geq 0} \Big| \ppsf\Big[\frac{K_n}{n^α} \leq x \Big] - \ppsf[S_{α,θ} \leq x] \Big| \leq \frac{C(α, θ)}{n^α} $$ holds with an explicit constant $C(α, θ)$. The key ingredients of the proof are a novel probabilistic representation of $K_n$ as compound distribution and new, refined versions of certain quantitative bounds for the Poisson approximation and the compound Poisson distribution.

preprint2020arXiv

De Finetti's theorem: rate of convergence in Kolmogorov distance

This paper provides a quantitative version of de Finetti law of large numbers. Given an infinite sequence $\{X_n\}_{n \geq 1}$ of exchangeable Bernoulli variables, it is well-known that $\frac{1}{n} \sum_{i = 1}^n X_i \stackrel{a.s.}{\longrightarrow} Y$, for a suitable random variable $Y$ taking values in $[0,1]$. Here, we consider the rate of convergence in law of $\frac{1}{n} \sum_{i = 1}^n X_i$ towards $Y$, with respect to the Kolmogorov distance. After showing that any rate of the type of $1/n^α$ can be obtained for any $α\in (0,1]$, we find a sufficient condition on the probability distribution of $Y$ for the achievement of the optimal rate of convergence, that is $1/n$. Our main result improve on existing literature: in particular, with respect to \cite{MPS}, we study a stronger metric while, with respect to \cite{Mna}, we weaken the regularity hypothesis on the probability distribution of $Y$.