Researcher profile

Eustasio del Barrio

Eustasio del Barrio contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

An improved central limit theorem and fast convergence rates for entropic transportation costs

We prove a central limit theorem for the entropic transportation cost between subgaussian probability measures, centered at the population cost. This is the first result which allows for asymptotically valid inference for entropic optimal transport between measures which are not necessarily discrete. In the compactly supported case, we complement these results with new, faster, convergence rates for the expected entropic transportation cost between empirical measures. Our proof is based on strengthening convergence results for dual solutions to the entropic optimal transport problem.

preprint2022arXiv

Central Limit Theorems for Semidiscrete Wasserstein Distances

We prove a Central Limit Theorem for the empirical optimal transport cost, $\sqrt{\frac{nm}{n+m}}\{\mathcal{T}_c(P_n,Q_m)-\mathcal{T}_c(P,Q)\}$, in the semi discrete case, i.e when the distribution $P$ is supported in $N$ points, but without assumptions on $Q$. We show that the asymptotic distribution is the supremun of a centered Gaussian process, which is Gaussian under some additional conditions on the probability $Q$ and on the cost. Such results imply the central limit theorem for the $p$-Wassertein distance, for $p\geq 1$. This means that, for fixed $N$, the curse of dimensionality is avoided. To better understand the influence of such $N$, we provide bounds of $E|\mathcal{W}_1(P,Q_m)-\mathcal{W}_1(P,Q)|$ depending on $m$ and $N$. Finally, the semidiscrete framework provides a control on the second derivative of the dual formulation, which yields the first central limit theorem for the optimal transport potentials. The results are supported by simulations that help to visualize the given limits and bounds. We analyse also the cases where classical bootstrap works.

preprint2022arXiv

Nonparametric Multiple-Output Center-Outward Quantile Regression

Based on the novel concept of multivariate center-outward quantiles introduced recently in Chernozhukov et al. (2017) and Hallin et al. (2021), we are considering the problem of nonparametric multiple-output quantile regression. Our approach defines nested conditional center-outward quantile regression contours and regions with given conditional probability content irrespective of the underlying distribution; their graphs constitute nested center-outward quantile regression tubes. Empirical counterparts of these concepts are constructed, yielding interpretable empirical regions and contours which are shown to consistently reconstruct their population versions in the Pompeiu-Hausdorff topology. Our method is entirely non-parametric and performs well in simulations including heteroskedasticity and nonlinear trends; its power as a data-analytic tool is illustrated on some real datasets.

preprint2021arXiv

Central Limit Theorems for General Transportation Costs

We consider the problem of optimal transportation with general cost between a empirical measure and a general target probability on R d , with d $\ge$ 1. We extend results in [19] and prove asymptotic stability of both optimal transport maps and potentials for a large class of costs in R d. We derive a central limit theorem (CLT) towards a Gaussian distribution for the empirical transportation cost under minimal assumptions, with a new proof based on the Efron-Stein inequality and on the sequential compactness of the closed unit ball in L 2 (P) for the weak topology. We provide also CLTs for empirical Wassertsein distances in the special case of potential costs | $\bullet$ | p , p > 1.

preprint2020arXiv

A survey of bias in Machine Learning through the prism of Statistical Parity for the Adult Data Set

Applications based on Machine Learning models have now become an indispensable part of the everyday life and the professional world. A critical question then recently arised among the population: Do algorithmic decisions convey any type of discrimination against specific groups of population or minorities? In this paper, we show the importance of understanding how a bias can be introduced into automatic decisions. We first present a mathematical framework for the fair learning problem, specifically in the binary classification setting. We then propose to quantify the presence of bias by using the standard Disparate Impact index on the real and well-known Adult income data set. Finally, we check the performance of different approaches aiming to reduce the bias in binary classification outcomes. Importantly, we show that some intuitive methods are ineffective. This sheds light on the fact trying to make fair machine learning models may be a particularly challenging task, in particular when the training observations contain a bias.

preprint2020arXiv

Center-Outward Distribution Functions, Quantiles, Ranks, and Signs in $\mathbb{R}^d$

Univariate concepts as quantile and distribution functions involving ranks and signs, do not canonically extend to $\mathbb{R}^d, d\geq 2$. Palliating that has generated an abundant literature. Chapter 1 shows that, unlike the many definitions that have been proposed so far, the measure transportation-based ones introduced in Chernozhukov et al. (2017) enjoy all the properties that make univariate quantiles and ranks successful tools for semiparametric statistical inference. We therefore propose a new center-outward definition of multivariate distribution and quantile functions, along with their empirical counterparts, for which we obtain a Glivenko-Cantelli result. Our approach is geometric and, contrary to the Monge-Kantorovich one in Chernozhukov et al. (2017), does not require any moment assumptions. The resulting ranks and signs are strictly distribution-free, and maximal invariant under the action of a data-driven class of (order-preserving) transformations generating the family of absolutely continuous distributions; that property is the theoretical foundation of the semiparametric efficiency preservation property of ranks. The corresponding quantiles are equivariant under the same transformations. The empirical proposed distribution functions are defined at observed values only. A continuous extension to the entire $\mathbb{R}^d$, yielding continuous empirical quantile contours while preserving the monotonicity and Glivenko-Cantelli features is desirable. Such extension requires solving a nontrivial problem of smooth interpolation under cyclical monotonicity constraints. A complete solution of that problem is given in Chapter 2; we show that the resulting distribution and quantile functions are Lipschitz, and provide a sharp lower bound for the Lipschitz constants. A numerical study of empirical center-outward quantile contours and their consistency is conducted.

preprint2020arXiv

optimalFlow: Optimal-transport approach to flow cytometry gating and population matching

Data obtained from Flow Cytometry present pronounced variability due to biological and technical reasons. Biological variability is a well-known phenomenon produced by measurements on different individuals, with different characteristics such as illness, age, sex, etc. The use of different settings for measurement, the variation of the conditions during experiments and the different types of flow cytometers are some of the technical causes of variability. This mixture of sources of variability makes the use of supervised machine learning for identification of cell populations difficult. The present work is conceived as a combination of strategies to facilitate the task of supervised gating. We propose $optimalFlowTemplates$, based on a similarity distance and $\text{Wasserstein barycenters}$, which clusters cytometries and produces prototype cytometries for the different groups. We show that supervised learning, restricted to the new groups, performs better than the same techniques applied to the whole collection. We also present $optimalFlowClassification$, which uses a database of gated cytometries and optimalFlowTemplates to assign cell types to a new cytometry. We show that this procedure can outperform state of the art techniques in the proposed datasets. Our code is freely available as $optimalFlow$ a Bioconductor R package at https://bioconductor.org/packages/optimalFlow. optimalFlowTemplates+optimalFlowClassification addresses the problem of using supervised learning while accounting for biological and technical variability. Our methodology provides a robust automated gating workflow that handles the intrinsic variability of flow cytometry data well. Our main innovation is the methodology itself and the optimal-transport techniques that we apply to flow cytometry analysis.

preprint2020arXiv

Review of Mathematical frameworks for Fairness in Machine Learning

A review of the main fairness definitions and fair learning methodologies proposed in the literature over the last years is presented from a mathematical point of view. Following our independence-based approach, we consider how to build fair algorithms and the consequences on the degradation of their performance compared to the possibly unfair case. This corresponds to the price for fairness given by the criteria $\textit{statistical parity}$ or $\textit{equality of odds}$. Novel results giving the expressions of the optimal fair classifier and the optimal fair predictor (under a linear regression gaussian model) in the sense of $\textit{equality of odds}$ are presented.

preprint2019arXiv

Box-constrained monotone $L_\infty$-approximations to Lipschitz regularizations, with applications to robust testing

Tests of fit to exact models in statistical analysis often lead to rejections even when the model is a useful approximate description of the random generator of the data. Among possible relaxations of a fixed model, the one defined by contamination neighbourhoods, namely, $\mathcal{V}_α(P_0)=\{(1-α)P_0+αQ: Q \in \mathcal{P}\}$, where $\mathcal{P}$ is the set of all probabilities in the sample space, has received much attention, from its central role in Robust Statistics. For probabilities on the real line, consistent tests of fit to $\mathcal{V}_α(P_0)$ can be based on $d_K(P_0,R_α(P))$, the minimal Kolmogorov distance between $P_0$ and the set of trimmings of $P$, $R_α(P)=\big\{\tilde P\in\mathcal{P}:\tilde P\ll P,\,{\textstyle \frac{d\tilde P}{dP}\leq\frac{1}{1-α}}\, P\text{-a.s.}\big\}$. We show that this functional admits equivalent formulations in terms of, either best approximation in uniform norm by $L$-Lipschitz functions satisfying a box constraint, or as the best monotone approximation in uniform norm to the $L$-Lipschitz regularization, which is seen to be expressable in terms of the average of the Pasch-Hausdorff envelopes. This representation for the solution of the variational problem allows to obtain results showing stability of the functional $d_K(P_0,R_α(P))$, as well as directional differentiability, providing the basis for a Central Limit Theorem for that functional.