Source author record

William Fithian

William Fithian appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning Applications math.ST Statistics Theory Computation

Catalog footprint

What is connected

13works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Conditional calibration for false discovery rate control under dependence

We introduce a new class of methods for finite-sample false discovery rate (FDR) control in multiple testing problems with dependent test statistics where the dependence is fully or partially known. Our approach separately calibrates a data-dependent p-value rejection threshold for each hypothesis, relaxing or tightening the threshold as appropriate to target exact FDR control. In addition to our general framework we propose a concrete algorithm, the dependence-adjusted Benjamini-Hochberg (dBH) procedure, which adaptively thresholds the q-value for each hypothesis. Under positive regression dependence the dBH procedure uniformly dominates the standard BH procedure, and in general it uniformly dominates the Benjamini-Yekutieli (BY) procedure (also known as BH with log correction). Simulations and real data examples illustrate power gains over competing approaches to FDR control under dependence.

preprint2020arXiv

Optimality of the max test for detecting sparse signals with Gaussian or heavier tail

A fundamental problem in high-dimensional testing is that of global null testing: testing whether the null holds simultaneously in all of $n$ hypotheses. The max test, which uses the smallest of the $n$ marginal p-values as its test statistic, enjoys widespread popularity for its simplicity and robustness. However, its theoretical performance relative to other tests has been called into question. In the Gaussian sequence version of the global testing problem, Donoho and Jin (2004) discovered a so-called "weak, sparse" asymptotic regime in which the higher criticism and Berk-Jones tests achieve a better detection boundary than the max test when all of the nonzero signal strengths are identical. We study a more general model in which the non-null means are drawn from a generic distribution, and show that the detection boundary for the max test is optimal in the "weak, sparse" regime, provided that the distribution's tail is no lighter than Gaussian. Further, we show theoretically and in simulation that the modified higher criticism of Donoho and Jin (2004) can have very low power when the distribution of non-null means has a polynomial tail.

preprint2020arXiv

STAR: A general interactive framework for FDR control under structural constraints

We propose a general framework based on selectively traversed accumulation rules (STAR) for interactive multiple testing with generic structural constraints on the rejection set. It combines accumulation tests from ordered multiple testing with data-carving ideas from post-selection inference, allowing for highly flexible adaptation to generic structural information. Our procedure defines an interactive protocol for gradually pruning a candidate rejection set, beginning with the set of all hypotheses and shrinking with each step. By restricting the information at each step via a technique we call masking, our protocol enables interaction while controlling the false discovery rate (FDR) in finite samples for any data-adaptive update rule that the analyst may choose. We suggest update rules for a variety of applications with complex structural constraints, show that STAR performs well for problems ranging from convex region detection to FDR control on directed acyclic graphs, and show how to extend it to regression problems where knockoff statistics are available in lieu of $p$-values.

preprint2020arXiv

Statistical Methods for Replicability Assessment

Large-scale replication studies like the Reproducibility Project: Psychology (RP:P) provide invaluable systematic data on scientific replicability, but most analyses and interpretations of the data fail to agree on the definition of "replicability" and disentangle the inexorable consequences of known selection bias from competing explanations. We discuss three concrete definitions of replicability based on (1) whether published findings about the signs of effects are mostly correct, (2) how effective replication studies are in reproducing whatever true effect size was present in the original experiment, and (3) whether true effect sizes tend to diminish in replication. We apply techniques from multiple testing and post-selection inference to develop new methods that answer these questions while explicitly accounting for selection bias. Our analyses suggest that the RP:P dataset is largely consistent with publication bias due to selection of significant effects. The methods in this paper make no distributional assumptions about the true effect sizes.

preprint2016arXiv

Data Augmentation via Levy Processes

If a document is about travel, we may expect that short snippets of the document should also be about travel. We introduce a general framework for incorporating these types of invariances into a discriminative classifier. The framework imagines data as being drawn from a slice of a Levy process. If we slice the Levy process at an earlier point in time, we obtain additional pseudo-examples, which can be used to train the classifier. We show that this scheme has two desirable properties: it preserves the Bayes decision boundary, and it is equivalent to fitting a generative model in the limit where we rewind time back to 0. Our construction captures popular schemes such as Gaussian feature noising and dropout training, as well as admitting new generalizations.

preprint2016arXiv

Multiple Correspondence Analysis & the Multilogit Bilinear Model

Multiple Correspondence Analysis (MCA) is a dimension reduction method which plays a large role in the analysis of tables with categorical nominal variables such as survey data. Though it is usually motivated and derived using geometric considerations, in fact we prove that it amounts to a single proximal Newtown step of a natural bilinear exponential family model for categorical data the multinomial logit bilinear model. We compare and contrast the behavior of MCA with that of the model on simulations and discuss new insights on the properties of both exploratory multivariate methods and their cognate models. One main conclusion is that we could recommend to approximate the multilogit model parameters using MCA. Indeed, estimating the parameters of the model is not a trivial task whereas MCA has the great advantage of being easily solved by singular value decomposition and scalable to large data.

preprint2016arXiv

Power of Ordered Hypothesis Testing

Ordered testing procedures are multiple testing procedures that exploit a pre-specified ordering of the null hypotheses, from most to least promising. We analyze and compare the power of several recent proposals using the asymptotic framework of Li & Barber (2015). While accumulation tests including ForwardStop can be quite powerful when the ordering is very informative, they are asymptotically powerless when the ordering is weaker. By contrast, Selective SeqStep, proposed by Barber & Candès (2015), is much less sensitive to the quality of the ordering. We compare the power of these procedures in different régimes, concluding that Selective SeqStep dominates accumulation tests if either the ordering is weak or non-null hypotheses are sparse or weak. Motivated by our asymptotic analysis, we derive an improved version of Selective SeqStep which we call Adaptive SeqStep, analogous to Storey's improvement on the Benjamini-Hochberg procedure. We compare these methods using the GEOQuery data set analyzed by Li & Barber (2015) and find Adaptive SeqStep has favorable performance for both good and bad prior orderings.

preprint2015arXiv

Selective Sequential Model Selection

Many model selection algorithms produce a path of fits specifying a sequence of increasingly complex models. Given such a sequence and the data used to produce them, we consider the problem of choosing the least complex model that is not falsified by the data. Extending the selected-model tests of Fithian et al. (2014), we construct p-values for each step in the path which account for the adaptive selection of the model path using the data. In the case of linear regression, we propose two specific tests, the max-t test for forward stepwise regression (generalizing a proposal of Buja and Brown (2014)), and the next-entry test for the lasso. These tests improve on the power of the saturated-model test of Tibshirani et al. (2014), sometimes dramatically. In addition, our framework extends beyond linear regression to a much more general class of parametric and nonparametric model selection problems. To select a model, we can feed our single-step p-values as inputs into sequential stopping rules such as those proposed by G'Sell et al. (2013) and Li and Barber (2015), achieving control of the familywise error rate or false discovery rate (FDR) as desired. The FDR-controlling rules require the null p-values to be independent of each other and of the non-null p-values, a condition not satisfied by the saturated-model p-values of Tibshirani et al. (2014). We derive intuitive and general sufficient conditions for independence, and show that our proposed constructions yield independent p-values.

preprint2014arXiv

Altitude Training: Strong Bounds for Single-Layer Dropout

Dropout training, originally designed for deep neural networks, has been successful on high-dimensional single-layer natural language tasks. This paper proposes a theoretical explanation for this phenomenon: we show that, under a generative Poisson topic model with long documents, dropout training improves the exponent in the generalization bound for empirical risk minimization. Dropout achieves this gain much like a marathon runner who practices at altitude: once a classifier learns to perform reasonably well on training examples that have been artificially corrupted by dropout, it will do very well on the uncorrupted test set. We also show that, under similar conditions, dropout preserves the Bayes decision boundary and should therefore induce minimal bias in high dimensions.

preprint2014arXiv

Bias Correction in Species Distribution Models: Pooling Survey and Collection Data for Multiple Species

Presence-only records may provide data on the distributions of rare species, but commonly suffer from large, unknown biases due to their typically haphazard collection schemes. Presence-absence or count data collected in systematic, planned surveys are more reliable but typically less abundant. We proposed a probabilistic model to allow for joint analysis of presence-only and survey data to exploit their complementary strengths. Our method pools presence-only and presence-absence data for many species and maximizes a joint likelihood, simultaneously estimating and adjusting for the sampling bias affecting the presence-only data. By assuming that the sampling bias is the same for all species, we can borrow strength across species to efficiently estimate the bias and improve our inference from presence-only data. We evaluate our model's performance on data for 36 eucalypt species in southeastern Australia. We find that presence-only records exhibit a strong sampling bias toward the coast and toward Sydney, the largest city. Our data-pooling technique substantially improves the out-of-sample predictive performance of our model when the amount of available presence-absence data for a given species is scarce. If we have only presence-only data and no presence-absence data for a given species, but both types of data for several other species that suffer from the same spatial sampling bias, then our method can obtain an unbiased estimate of the first species' geographic range.

preprint2014arXiv

Finite-sample equivalence in statistical models for presence-only data

Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the IPP intensity function is a more natural object of inference in presence-only studies than occurrence probability (which is only defined with reference to quadrat size), and why presence-only data only allows estimation of relative, and not absolute intensity of species occurrence. All three of the above techniques amount to parametric density estimation under the same exponential family model (in the case of the IPP, the fitted density is multiplied by the number of presence records to obtain a fitted intensity). We show that IPP and Maxent give the exact same estimate for this density, but logistic regression in general yields a different estimate in finite samples. When the model is misspecified - as it practically always is - logistic regression and the IPP may have substantially different asymptotic limits with large data sets. We propose ``infinitely weighted logistic regression,'' which is exactly equivalent to the IPP in finite samples. Consequently, many already-implemented methods extending logistic regression can also extend the Maxent and IPP models in directly analogous ways using this technique.

preprint2014arXiv

Local case-control sampling: Efficient subsampling in imbalanced data sets

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients $θ^*$. By contrast, our estimator is consistent for $θ^*$ provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE - even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to $1+\frac{1}{c}$ if we multiply the baseline acceptance probabilities by $c>1$ (and weight points with acceptance probability greater than 1), taking roughly $\frac{1+c}{2}$ times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

preprint2014arXiv

Semiparametric Exponential Families for Heavy-Tailed Data

We propose a semiparametric method for fitting the tail of a heavy-tailed population given a relatively small sample from that population and a larger sample from a related background population. We model the tail of the small sample as an exponential tilt of the better-observed large-sample tail, using a robust sufficient statistic motivated by extreme value theory. In particular, our method induces an estimator of the small-population mean, and we give theoretical and empirical evidence that this estimator outperforms methods that do not use the background sample. We demonstrate substantial efficiency gains over competing methods in simulation and on data from a large controlled experiment conducted by Facebook.

William Fithian

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Conditional calibration for false discovery rate control under dependence

Optimality of the max test for detecting sparse signals with Gaussian or heavier tail

STAR: A general interactive framework for FDR control under structural constraints

Statistical Methods for Replicability Assessment

Data Augmentation via Levy Processes

Multiple Correspondence Analysis & the Multilogit Bilinear Model

Power of Ordered Hypothesis Testing

Selective Sequential Model Selection

Altitude Training: Strong Bounds for Single-Layer Dropout

Bias Correction in Species Distribution Models: Pooling Survey and Collection Data for Multiple Species

Finite-sample equivalence in statistical models for presence-only data

Local case-control sampling: Efficient subsampling in imbalanced data sets

Semiparametric Exponential Families for Heavy-Tailed Data