Source author record

Etienne Roquain

Etienne Roquain appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Methodology Statistics Theory Applications

Catalog footprint

What is connected

12works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Online multiple testing with super-uniformity reward

Valid online inference is an important problem in contemporary multiple testing research,to which various solutions have been proposed recently. It is well-known that these existing methods can suffer from a significant loss of power if the null $p$-values are conservative. In this work, we extend the previously introduced methodology to obtain more powerful procedures for the case of super-uniformly distributed $p$-values. These types of $p$-values arise in important settings, e.g. when discrete hypothesis tests are performed or when the $p$-values are weighted. To this end, we introduce the method of super-uniformity reward (SUR) that incorporates information about the individual null cumulative distribution functions. Our approach yields several new 'rewarded' procedures that offer uniform power improvements over known procedures and come with mathematical guarantees for controlling online error criteria based either on the family-wise error rate (FWER) or the marginal false discovery rate (mFDR). We illustrate the benefit of super-uniform rewarding in real-data analyses and simulation studies. While discrete tests serve as our leading example, we also show how our method can be applied to weighted $p$-values.

preprint2022arXiv

Empirical Bayes cumulative $\ell$-value multiple testing procedure for sparse sequences

In the sparse sequence model, we consider a popular Bayesian multiple testing procedure and investigate for the first time its behaviour from the frequentist point of view. Given a spike-and-slab prior on the high-dimensional sparse unknown parameter, one can easily compute posterior probabilities of coming from the spike, which correspond to the well known local-fdr values, also called $\ell$-values. The spike-and-slab weight parameter is calibrated in an empirical Bayes fashion, using marginal maximum likelihood. The multiple testing procedure under study, called here the cumulative $\ell$-value procedure, ranks coordinates according to their empirical $\ell$-values and thresholds so that the cumulative ranked sum does not exceed a user-specified level $t$. We validate the use of this method from the multiple testing perspective: for alternatives of appropriately large signal strength, the false discovery rate (FDR) of the procedure is shown to converge to the target level $t$, while its false negative rate (FNR) goes to $0$. We complement this study by providing convergence rates for the method. Additionally, we prove that the $q$-value multiple testing procedure shares similar convergence rates in this model.

preprint2015arXiv

New procedures controlling the false discovery proportion via Romano-Wolf's heuristic

The false discovery proportion (FDP) is a convenient way to account for false positives when a large number $m$ of tests are performed simultaneously. Romano and Wolf [Ann. Statist. 35 (2007) 1378-1408] have proposed a general principle that builds FDP controlling procedures from $k$-family-wise error rate controlling procedures while incorporating dependencies in an appropriate manner; see Korn et al. [J. Statist. Plann. Inference 124 (2004) 379-398]; Romano and Wolf (2007). However, the theoretical validity of the latter is still largely unknown. This paper provides a careful study of this heuristic: first, we extend this approach by using a notion of "bounding device" that allows us to cover a wide range of critical values, including those that adapt to $m\_0$, the number of true null hypotheses. Second, the theoretical validity of the latter is investigated both nonasymptotically and asymptotically. Third, we introduce suitable modifications of this heuristic that provide new methods, overcoming the existing procedures with a proven FDP control.

preprint2014arXiv

Testing over a continuum of null hypotheses with False Discovery Rate control

We consider statistical hypothesis testing simultaneously over a fairly general, possibly uncountably infinite, set of null hypotheses, under the assumption that a suitable single test (and corresponding $p$-value) is known for each individual hypothesis. We extend to this setting the notion of false discovery rate (FDR) as a measure of type I error. Our main result studies specific procedures based on the observation of the $p$-value process. Control of the FDR at a nominal level is ensured either under arbitrary dependence of $p$-values, or under the assumption that the finite dimensional distributions of the $p$-value process have positive correlations of a specific type (weak PRDS). Both cases generalize existing results established in the finite setting. Its interest is demonstrated in several non-parametric examples: testing the mean/signal in a Gaussian white noise model, testing the intensity of a Poisson process and testing the c.d.f. of i.i.d. random variables.

preprint2013arXiv

On empirical distribution function of high-dimensional Gaussian vector components with an application to multiple testing

This paper introduces a new framework to study the asymptotical behavior of the empirical distribution function (e.d.f.) of Gaussian vector components, whose correlation matrix $Γ^{(m)}$ is dimension-dependent. Hence, by contrast with the existing literature, the vector is not assumed to be stationary. Rather, we make a "vanishing second order" assumption ensuring that the covariance matrix $Γ^{(m)}$ is not too far from the identity matrix, while the behavior of the e.d.f. is affected by $Γ^{(m)}$ only through the sequence $γ_m=m^{-2} \sum_{i\neq j} Γ_{i,j}^{(m)}$, as $m$ grows to infinity. This result recovers some of the previous results for stationary long-range dependencies while it also applies to various, high-dimensional, non-stationary frameworks, for which the most correlated variables are not necessarily next to each other. Finally, we present an application of this work to the multiple testing problem, which was the initial statistical motivation for developing such a methodology.

preprint2013arXiv

On false discovery rate thresholding for classification under sparsity

We study the properties of false discovery rate (FDR) thresholding, viewed as a classification procedure. The "0"-class (null) is assumed to have a known density while the "1"-class (alternative) is obtained from the "0"-class either by translation or by scaling. Furthermore, the "1"-class is assumed to have a small number of elements w.r.t. the "0"-class (sparsity). We focus on densities of the Subbotin family, including Gaussian and Laplace models. Nonasymptotic oracle inequalities are derived for the excess risk of FDR thresholding. These inequalities lead to explicit rates of convergence of the excess risk to zero, as the number m of items to be classified tends to infinity and in a regime where the power of the Bayes rule is away from 0 and 1. Moreover, these theoretical investigations suggest an explicit choice for the target level $α_m$ of FDR thresholding, as a function of m. Our oracle inequalities show theoretically that the resulting FDR thresholding adapts to the unknown sparsity regime contained in the data. This property is illustrated with numerical experiments.

preprint2011arXiv

On least favorable configurations for step-up-down tests

This paper investigates an open issue related to false discovery rate (FDR) control of step-up-down (SUD) multiple testing procedures. It has been established in earlier literature that for this type of procedure, under some broad conditions, and in an asymptotical sense, the FDR is maximum when the signal strength under the alternative is maximum. In other words, so-called "Dirac uniform configurations" are asymptotically {\em least favorable} in this setting. It is known that this property also holds in a non-asymptotical sense (for any finite number of hypotheses), for the two extreme versions of SUD procedures, namely step-up and step-down (with extra conditions for the step-down case). It is therefore very natural to conjecture that this non-asymptotical {\em least favorable configuration} property could more generally be true for all "intermediate" forms of SUD procedures. We prove that this is, somewhat surprisingly, not the case. The argument is based on the exact calculations proposed earlier by Roquain and Villers (2011), that we extend here by generalizing Steck's recursion to the case of two populations. Secondly, we quantify the magnitude of this phenomenon by providing a nonasymptotic upper-bound and explicit vanishing rates as a function of the total number of hypotheses.

preprint2011arXiv

Type I error rate control for testing many hypotheses: a survey with proofs

This paper presents a survey on some recent advances for the type I error rate control in multiple testing methodology. We consider the problem of controlling the $k$-family-wise error rate (kFWER, probability to make $k$ false discoveries or more) and the false discovery proportion (FDP, proportion of false discoveries among the discoveries). The FDP is controlled either via its expectation, which is the so-called false discovery rate (FDR), or via its upper-tail distribution function. We aim at deriving general and unified results together with concise and simple mathematical proofs. Furthermore, while this paper is mainly meant to be a survey paper, some new contributions for controlling the kFWER and the upper-tail distribution function of the FDP are provided. In particular, we derive a new procedure based on the quantiles of the binomial distribution that controls the FDP under independence.

preprint2010arXiv

Exact calculations for false discovery proportion with application to least favorable configurations

In a context of multiple hypothesis testing, we provide several new exact calculations related to the false discovery proportion (FDP) of step-up and step-down procedures. For step-up procedures, we show that the number of erroneous rejections conditionally on the rejection number is simply a binomial variable, which leads to explicit computations of the c.d.f., the {$s$-th} moment and the mean of the FDP, the latter corresponding to the false discovery rate (FDR). For step-down procedures, we derive what is to our knowledge the first explicit formula for the FDR valid for any alternative c.d.f. of the $p$-values. We also derive explicit computations of the power for both step-up and step-down procedures. These formulas are "explicit" in the sense that they only involve the parameters of the model and the c.d.f. of the order statistics of i.i.d. uniform variables. The $p$-values are assumed either independent or coming from an equicorrelated multivariate normal model and an additional mixture model for the true/false hypotheses is used. This new approach is used to investigate new results which are of interest in their own right, related to least/most favorable configurations for the FDR and the variance of the FDP.

preprint2010arXiv

On the false discovery proportion convergence under Gaussian equi-correlation

We study the convergence of the false discovery proportion (FDP) of the Benjamini-Hochberg procedure in the Gaussian equi-correlated model, when the correlation $ρ_m$ converges to zero as the hypothesis number $m$ grows to infinity. By contrast with the standard convergence rate $m^{1/2}$ holding under independence, this study shows that the FDP converges to the false discovery rate (FDR) at rate $\{\min(m,1/ρ_m)\}^{1/2}$ in this equi-correlated model.

preprint2010arXiv

Some nonasymptotic results on resampling in high dimension, I: Confidence regions, II: Multiple tests

We study generalized bootstrap confidence regions for the mean of a random vector whose coordinates have an unknown dependency structure. The random vector is supposed to be either Gaussian or to have a symmetric and bounded distribution. The dimensionality of the vector can possibly be much larger than the number of observations and we focus on a nonasymptotic control of the confidence level, following ideas inspired by recent results in learning theory. We consider two approaches, the first based on a concentration principle (valid for a large class of resampling weights) and the second on a resampled quantile, specifically using Rademacher weights. Several intermediate results established in the approach based on concentration principles are of interest in their own right. We also discuss the question of accuracy when using Monte Carlo approximations of the resampled quantities.

preprint2010arXiv

Spatial clustering of array CGH features in combination with hierarchical multiple testing

We propose a new approach for clustering DNA features using array CGH data from multiple tumor samples. We distinguish data-collapsing: joining contiguous DNA clones or probes with extremely similar data into regions, from clustering: joining contiguous, correlated regions based on a maximum likelihood principle. The model-based clustering algorithm accounts for the apparent spatial patterns in the data. We evaluate the randomness of the clustering result by a cluster stability score in combination with cross-validation. Moreover, we argue that the clustering really captures spatial genomic dependency by showing that coincidental clustering of independent regions is very unlikely. Using the region and cluster information, we combine testing of these for association with a clinical variable in an hierarchical multiple testing approach. This allows for interpreting the significance of both regions and clusters while controlling the Family-Wise Error Rate simultaneously. We prove that in the context of permutation tests and permutation-invariant clusters it is allowed to perform clustering and testing on the same data set. Our procedures are illustrated on two cancer data sets.

Etienne Roquain

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Online multiple testing with super-uniformity reward

Empirical Bayes cumulative $\ell$-value multiple testing procedure for sparse sequences

New procedures controlling the false discovery proportion via Romano-Wolf's heuristic

Testing over a continuum of null hypotheses with False Discovery Rate control

On empirical distribution function of high-dimensional Gaussian vector components with an application to multiple testing

On false discovery rate thresholding for classification under sparsity

On least favorable configurations for step-up-down tests

Type I error rate control for testing many hypotheses: a survey with proofs

Exact calculations for false discovery proportion with application to least favorable configurations

On the false discovery proportion convergence under Gaussian equi-correlation

Some nonasymptotic results on resampling in high dimension, I: Confidence regions, II: Multiple tests

Spatial clustering of array CGH features in combination with hierarchical multiple testing