Source author record

Robert Tibshirani

Robert Tibshirani appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning Applications math.ST Statistics Theory Quantitative Methods Artificial Intelligence Computation

Catalog footprint

What is connected

39works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Univariate-Guided Interaction Modeling

We propose a procedure for sparse regression with pairwise interactions, by generalizing the Univariate Guided Sparse Regression (UniLasso) methodology. A central contribution is our introduction of a concept of univariate (or marginal) interactions. Using this concept, we propose two algorithms -- uniPairs and uniPairs-2stage -- , and evaluate their performance against established methods, including Glinternet and Sprinter. We show that our framework yields sparser models with more interpretable interactions. We also prove support recovery results for our proposal under suitable conditions.

preprint2023arXiv

Weakest link pruning of a dendrogram

Hierarchical clustering is a popular method for identifying distinct groups in a dataset. The most commonly used method for pruning a dendrogram is via a single horizontal cut. In this paper, we propose a new technique "weakest link optimal pruning". We prove its superiority over horizontal pruning and provide some examples illustrating how the two methods can behave quite differently.

preprint2022arXiv

Confidence Intervals for the Generalisation Error of Random Forests

Out-of-bag error is commonly used as an estimate of generalisation error in ensemble-based learning models such as random forests. We present confidence intervals for this quantity using the delta-method-after-bootstrap and the jackknife-after-bootstrap techniques. These methods do not require growing any additional trees. We show that these new confidence intervals have improved coverage properties over the naive confidence interval, in real and simulated examples.

preprint2022arXiv

FastCPH: Efficient Survival Analysis for Neural Networks

The Cox proportional hazards model is a canonical method in survival analysis for prediction of the life expectancy of a patient given clinical or genetic covariates -- it is a linear model in its original form. In recent years, several methods have been proposed to generalize the Cox model to neural networks, but none of these are both numerically correct and computationally efficient. We propose FastCPH, a new method that runs in linear time and supports both the standard Breslow and Efron methods for tied events. We also demonstrate the performance of FastCPH combined with LassoNet, a neural network that provides interpretability through feature sparsity, on survival datasets. The final procedure is efficient, selects useful covariates and outperforms existing CoxPH approaches.

preprint2020arXiv

Assessment of Heterogeneous Treatment Effect Estimation Accuracy via Matching

We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach on synthetic data and data generated from a real dataset.

preprint2020arXiv

Feature-weighted elastic net: using "features of features" for better prediction

In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.

preprint2020arXiv

Reluctant generalized additive modeling

Sparse generalized additive models (GAMs) are an extension of sparse generalized linear models which allow a model's prediction to vary non-linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interaction modeling (Yu et al. 2019), we propose a multi-stage algorithm, called $\textit{reluctant generalized additive modeling (RGAM)}$, that can fit sparse generalized additive models at scale. It is guided by the principle that, if all else is equal, one should prefer a linear feature over a non-linear feature. Unlike existing methods for sparse GAMs, RGAM can be extended easily to binary, count and survival data. We demonstrate the method's effectiveness on real and simulated examples.

preprint2016arXiv

Customized training with an application to mass spectrometric imaging of cancer tissue

We introduce a simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting. Our proposal - customized training - clusters the data to find training points close to each test point and then fits an $\ell_1$-regularized model (lasso) separately in each training cluster. This approach combines the local adaptivity of $k$-nearest neighbors with the interpretability of the lasso. Although we use the lasso for the model fitting, any supervised learning method can be applied to the customized training sets. We apply the method to a mass-spectrometric imaging data set from an ongoing collaboration in gastric cancer detection which demonstrates the power and interpretability of the technique. Our idea is simple but potentially useful in situations where the data have some underlying structure.

preprint2016arXiv

High-dimensional regression adjustments in randomized experiments

We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information, and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation, and flexible non-parametric regression adjustments with machine learning methods such as random forests or neural networks.

preprint2016arXiv

Post-selection inference for L1-penalized likelihood models

We present a new method for post-selection inference for L1 (lasso)-penalized likelihood models, including generalized regression models. Our approach generalizes the post-selection framework presented in Lee et al (2014). The method provides p-values and confidence intervals that are asymptotically valid, conditional on the inherent selection done by the lasso. We present applications of this work to (regularized) logistic regression, Cox's proportional hazards model and the graphical lasso.

preprint2016arXiv

Regularization for supervised learning via the "hubNet" procedure

We propose a new method for supervised learning. The hubNet procedure fits a hub-based graphical model to the predictors, to estimate the amount of "connection" that each predictor has with other predictors. This yields a set of predictor weights that are then used in a regularized regression such as the lasso or elastic net. The resulting procedure is easy to implement, can sometimes yields higher prediction accuracy that the lasso, and can give insights into the underlying structure of the predictors. HubNet can also be generalized seamlessly to other supervised problems such as regularized logistic regression (and other GLMs), Cox's proportional hazards model, and nonlinear procedures such as random forests and boosting. We prove some recovery results under a specialized model and illustrate the method on real and simulated data.

preprint2015arXiv

A general framework for estimation and inference from clusters of features

Applied statistical problems often come with pre-specified groupings to predictors. It is natural to test for the presence of simultaneous group-wide signal for groups in isolation, or for multiple groups together. Classical tests for the presence of such signals rely either on tests for the omission of the entire block of variables (the classical F-test) or on the creation of an unsupervised prototype for the group (either a group centroid or first principal component) and subsequent t-tests on these prototypes. In this paper, we propose test statistics that aim for power improvements over these classical approaches. In particular, we first create group prototypes, with reference to the response, hopefully improving on the unsupervised prototypes, and then testing with likelihood ratio statistics incorporating only these prototypes. We propose a (potentially) novel model, called the "prototype model", which naturally models the two-step prototype-then-test procedure. Furthermore, we introduce an inferential schema detailing the unique considerations for different combinations of prototype formation and univariate/multivariate testing models. The prototype model also suggests new applications to estimation and prediction. Prototype formation often relies on variable selection, which invalidates classical Gaussian test theory. We use recent advances in selective inference to account for selection in the prototyping step and retain test validity. Simulation experiments suggest that our testing procedure enjoys more power than do classical approaches.

preprint2015arXiv

A Selective Approach to Internal Inference

A common goal in modern biostatistics is to form a biomarker signature from high dimensional gene expression data that is predictive of some outcome of interest. After learning this biomarker signature, an important question to answer is how well it predicts the response compared to classical predictors. This is challenging, because the biomarker signature is an internal predictor -- one that has been learned using the same dataset on which we want to evaluate it's significance. We propose a new method for approaching this problem based on the technique of selective inference. Simulations show that our method is able to properly control the level of the test, and that in certain settings we have more power than sample splitting.

preprint2015arXiv

Convex hierarchical testing of interactions

We consider the testing of all pairwise interactions in a two-class problem with many features. We devise a hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect. The test is based on a convex optimization framework that seamlessly considers main effects and interactions together. We show - both in simulation and on a genomic data set from the SAPPHIRe study - a potential gain in power and interpretability over a standard (nonhierarchical) interaction test.

preprint2015arXiv

Exact Post-Selection Inference for Sequential Regression Procedures

We propose new inference tools for forward stepwise regression, least angle regression, and the lasso. Assuming a Gaussian model for the observation vector y, we first describe a general scheme to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set. This framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path, because, as it turns out, selection events for these procedures can be expressed as polyhedral constraints on y. The p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. The R package "selectiveInference", freely available on the CRAN repository, implements the new inference tools described in this paper.

preprint2015arXiv

Post-selection point and interval estimation of signal sizes in Gaussian samples

We tackle the problem of the estimation of a vector of means from a single vector-valued observation $y$. Whereas previous work reduces the size of the estimates for the largest (absolute) sample elements via shrinkage (like James-Stein) or biases estimated via empirical Bayes methodology, we take a novel approach. We adapt recent developments by Lee et al (2013) in post selection inference for the Lasso to the orthogonal setting, where sample elements have different underlying signal sizes. This is exactly the setup encountered when estimating many means. It is shown that other selection procedures, like selecting the $K$ largest (absolute) sample elements and the Benjamini-Hochberg procedure, can be cast into their framework, allowing us to leverage their results. Point and interval estimates for signal sizes are proposed. These seem to perform quite well against competitors, both recent and more tenured. Furthermore, we prove an upper bound to the worst case risk of our estimator, when combined with the Benjamini-Hochberg procedure, and show that it is within a constant multiple of the minimax risk over a rich set of parameter spaces meant to evoke sparsity.

preprint2015arXiv

Selecting the number of principal components: estimation of the true rank of a noisy matrix

Principal component analysis (PCA) is a well-known tool in multivariate statistics. One significant challenge in using PCA is the choice of the number of components. In order to address this challenge, we propose an exact distribution-based method for hypothesis testing and construction of confidence intervals for signals in a noisy matrix. Assuming Gaussian noise, we use the conditional distribution of the singular values of a Wishart matrix and derive exact hypothesis tests and confidence intervals for the true signals. Our paper is based on the approach of Taylor, Loftus and Tibshirani (2013) for testing the global null: we generalize it to test for any number of principal components, and derive an integrated version with greater power. In simulation studies we find that our proposed methods compare well to existing approaches.

preprint2015arXiv

Selective Sequential Model Selection

Many model selection algorithms produce a path of fits specifying a sequence of increasingly complex models. Given such a sequence and the data used to produce them, we consider the problem of choosing the least complex model that is not falsified by the data. Extending the selected-model tests of Fithian et al. (2014), we construct p-values for each step in the path which account for the adaptive selection of the model path using the data. In the case of linear regression, we propose two specific tests, the max-t test for forward stepwise regression (generalizing a proposal of Buja and Brown (2014)), and the next-entry test for the lasso. These tests improve on the power of the saturated-model test of Tibshirani et al. (2014), sometimes dramatically. In addition, our framework extends beyond linear regression to a much more general class of parametric and nonparametric model selection problems. To select a model, we can feed our single-step p-values as inputs into sequential stopping rules such as those proposed by G'Sell et al. (2013) and Li and Barber (2015), achieving control of the familywise error rate or false discovery rate (FDR) as desired. The FDR-controlling rules require the null p-values to be independent of each other and of the non-null p-values, a condition not satisfied by the saturated-model p-values of Tibshirani et al. (2014). We derive intuitive and general sufficient conditions for independence, and show that our proposed constructions yield independent p-values.

preprint2015arXiv

Sequential Selection Procedures and False Discovery Rate Control

We consider a multiple hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block, H_1,\dots,H_k, of hypotheses. A rejection rule in this setting amounts to a procedure for choosing the stopping point k. This setting is inspired by the sequential nature of many model selection problems, where choosing a stopping point or a model is equivalent to rejecting all hypotheses up to that point and none thereafter. We propose two new testing procedures, and prove that they control the false discovery rate in the ordered testing setting. We also show how the methods can be applied to model selection using recent results on p-values in sequential model selection settings.

preprint2015arXiv

Sparse regression and marginal testing using cluster prototypes

We propose a new approach for sparse regression and marginal testing, for data with correlated features. Our procedure first clusters the features, and then chooses as the cluster prototype the most informative feature in that cluster. Then we apply either sparse regression (lasso) or marginal significance testing to these prototypes. While this kind of strategy is not entirely new, a key feature of our proposal is its use of the post-selection inference theory of Taylor et al. (2014) and Lee et al. (2014) to compute exact p-values and confidence intervals that properly account for the selection of prototypes. We also apply the recent "knockoff" idea of Barber and Candès to provide exact finite sample control of the FDR of our regression procedure. We illustrate our proposals on both real and simulated data.

preprint2014arXiv

A significance test for the lasso

In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an $\operatorname {Exp}(1)$ asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix $X$. On the other hand, our proof for a general step in the lasso path places further technical assumptions on $X$ and the generative model, but still allows for the important high-dimensional case $p>n$, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a $χ^2_1$ distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than $χ^2_1$ under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter $λ$ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the $\ell_1$ penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties - adaptivity and shrinkage - and its null distribution is tractable and asymptotically $\operatorname {Exp}(1)$.

preprint2014arXiv

A Study of Error Variance Estimation in Lasso Regression

Variance estimation in the linear model when $p > n$ is a difficult problem. Standard least squares estimation techniques do not apply. Several variance estimators have been proposed in the literature, all with accompanying asymptotic results proving consistency and asymptotic normality under a variety of assumptions. It is found, however, that most of these estimators suffer large biases in finite samples when true underlying signals become less sparse with larger per element signal strength. One estimator seems to be largely neglected in the literature: a residual sum of squares based estimator using Lasso coefficients with regularisation parameter selected adaptively (via cross-validation). In this paper, we review several variance estimators and perform a reasonably extensive simulation study in an attempt to compare their finite sample performance. It would seem from the results that variance estimators with adaptively chosen regularisation parameters perform admirably over a broad range of sparsity and signal strength settings. Finally, some intial theoretical analyses pertaining to these types of estimators are proposed and developed.

preprint2014arXiv

Collaborative Regression

We consider the scenario where one observes an outcome variable and sets of features from multiple assays, all measured on the same set of samples. One approach that has been proposed for dealing with this type of data is ``sparse multiple canonical correlation analysis'' (sparse mCCA). All of the current sparse mCCA techniques are biconvex and thus have no guarantees about reaching a global optimum. We propose a method for performing sparse supervised canonical correlation analysis (sparse sCCA), a specific case of sparse mCCA when one of the datasets is a vector. Our proposal for sparse sCCA is convex and thus does not face the same difficulties as the other methods. We derive efficient algorithms for this problem, and illustrate their use on simulated and real data.

preprint2014arXiv

Comment on "Detecting Novel Associations In Large Data Sets" by Reshef Et Al, Science Dec 16, 2011

The proposal of Reshef et al. (2011) is an interesting new approach for discovering non-linear dependencies among pairs of measurements in exploratory data mining. However, it has a potentially serious drawback. The authors laud the fact that MIC has no preference for some alternatives over others, but as the authors know, there is no free lunch in Statistics: tests which strive to have high power against all alternatives can have low power in many important situations. To investigate this, we ran simulations to compare the power of MIC to that of standard Pearson correlation and distance correlation (dcor). We simulated pairs of variables with different relationships (most of which were considered by the Reshef et. al.), but with varying levels of noise added. To determine proper cutoffs for testing the independence hypothesis, we simulated independent data with the appropriate marginals. As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome.

preprint2014arXiv

Regularisation Paths for Conditional Logistic Regression: the clogitL1 package

We apply the cyclic coordinate descent algorithm of Friedman, Hastie and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso ($\ell_1$) and elastic net penalties. The sequential strong rules of Tibshirani et al (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularisation paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by.

preprint2014arXiv

Rejoinder: "A significance test for the lasso"

Rejoinder of "A significance test for the lasso" by Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani [arXiv:1301.7161].

preprint2013arXiv

A lasso for hierarchical interactions

We add a set of convex constraints to the lasso to produce sparse interaction models that honor the hierarchy restriction that an interaction only be included in a model if one or both variables are marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove that hierarchy holds with probability one and derive an unbiased estimate for the degrees of freedom of our estimator. A bound on this estimate reveals the amount of fitting "saved" by the hierarchy constraint. We distinguish between parameter sparsity - the number of nonzero coefficients - and practical sparsity - the number of raw variables one must measure to make a new prediction. Hierarchy focuses on the latter, which is more closely tied to important data collection concerns such as cost, time and effort. We develop an algorithm, available in the R package hierNet, and perform an empirical study of our method.

preprint2013arXiv

Adaptive testing for the graphical lasso

We consider tests of significance in the setting of the graphical lasso for inverse covariance matrix estimation. We propose a simple test statistic based on a subsequence of the knots in the graphical lasso path. We show that this statistic has an exponential asymptotic null distribution, under the null hypothesis that the model contains the true connected components. Though the null distribution is asymptotic, we show through simulation that it provides a close approximation to the true distribution at reasonable sample sizes. Thus the test provides a simple, tractable test for the significance of new edges as they are introduced into the model. Finally, we show connections between our results and other results for regularized regression, as well as extensions of our results to other correlation matrix based methods like single-linkage clustering.

preprint2013arXiv

An Investigation of Methods for Handling Missing Data with Penalized Regression

We investigate methods for penalized regression in the presence of missing observations. This paper introduces a method for estimating the parameters which compensates for the missing observations. We first, derive an unbiased estimator of the objective function with respect to the missing data and then, modify the criterion to ensure convexity. Finally, we extend our approach to a family of models that embraces the mean imputation method. These approaches are compared to the mean imputation method, one of the simplest methods for dealing with missing observations problem, via simulations. We also investigate the problem of making predictions when there are missing values in the test set.

preprint2013arXiv

False Variable Selection Rates in Regression

There has been recent interest in extending the ideas of False Discovery Rates (FDR) to variable selection in regression settings. Traditionally the FDR in these settings has been defined in terms of the coefficients of the full regression model. Recent papers have struggled with controlling this quantity when the predictors are correlated. This paper shows that this full model definition of FDR suffers from unintuitive and potentially undesirable behavior in the presence of correlated predictors. We propose a new false selection error criterion, the False Variable Rate (FVR), that avoids these problems and behaves in a more intuitive manner. We discuss the behavior of this criterion and how it compares with the traditional FDR, as well as presenting guidelines for determining which is appropriate in a particular setting. Finally, we present a simple estimation procedure for FVR in stepwise variable selection. We analyze the performance of this estimator and draw connections to recent estimators in the literature.

preprint2013arXiv

Sensitivity Analysis for Inference with Partially Identifiable Covariance Matrices

In some multivariate problems with missing data, pairs of variables exist that are never observed together. For example, some modern biological tools can produce data of this form. As a result of this structure, the covariance matrix is only partially identifiable, and point estimation requires that identifying assumptions be made. These assumptions can introduce an unknown and potentially large bias into the inference. This paper presents a method based on semidefinite programming for automatically quantifying this potential bias by computing the range of possible equal-likelihood inferred values for convex functions of the covariance matrix. We focus on the bias of missing value imputation via conditional expectation and show that our method can give an accurate assessment of the true error in cases where estimates based on sampling uncertainty alone are overly optimistic.

preprint2012arXiv

A Permutation Approach to Testing Interactions in Many Dimensions

To date, testing interactions in high dimensions has been a challenging task. Existing methods often have issues with sensitivity to modeling assumptions and heavily asymptotic nominal p-values. To help alleviate these issues, we propose a permutation-based method for testing marginal interactions with a binary response. Our method searches for pairwise correlations which differ between classes. In this manuscript, we compare our method on real and simulated data to the standard approach of running many pairwise logistic models. On simulated data our method finds more significant interactions at a lower false discovery rate (especially in the presence of main effects). On real genomic data, although there is no gold standard, our method finds apparent signal and tells a believable story, while logistic regression does not. We also give asymptotic consistency results under not too restrictive assumptions.

preprint2012arXiv

A Simple Method for Detecting Interactions between a Treatment and a Large Number of Covariates

We consider a setting in which we have a treatment and a large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces valid inferences in a variety of settings. It can be useful for personalized medicine: determining from a large set of biomarkers the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and gene expression studies of cancer. The modified data can be used for other purposes, for example large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable.

preprint2012arXiv

Prototype selection for interpretable classification

Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of "representative" samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables methods, this paper aims at achieving sparsity in the samples. We discuss a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories). Our method of focus is derived from three basic properties that we believe a good prototype set should satisfy. This intuition is translated into a set cover optimization problem, which we solve approximately using standard approaches. While prototype selection is usually viewed as purely a means toward building an efficient classifier, in this paper we emphasize the inherent value of having a set of prototypical elements. That said, by using the nearest-neighbor rule on the set of prototypes, we can of course discuss our method as a classifier as well.

preprint2010arXiv

Bayesian Gene Set Analysis

Gene expression microarray technologies provide the simultaneous measurements of a large number of genes. Typical analyses of such data focus on the individual genes, but recent work has demonstrated that evaluating changes in expression across predefined sets of genes often increases statistical power and produces more robust results. We introduce a new methodology for identifying gene sets that are differentially expressed under varying experimental conditions. Our approach uses a hierarchical Bayesian framework where a hyperparameter measures the significance of each gene set. Using simulated data, we compare our proposed method to alternative approaches, such as Gene Set Enrichment Analysis (GSEA) and Gene Set Analysis (GSA). Our approach provides the best overall performance. We also discuss the application of our method to experimental data based on p53 mutation status.

preprint2010arXiv

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Often this data is transposable, meaning that both the row variables and column variables are of potential interest. An example of this scenario is detecting significant genes in microarrays when the samples or arrays may be dependent due to underlying relationships. We study the effect of both row and column correlations on commonly used test-statistics, null distributions, and multiple testing procedures, by explicitly modeling the covariances with the matrix-variate normal distribution. Using this model, we give both theoretical and simulation results revealing the problems associated with using standard statistical methodology on transposable data. We solve these problems by estimating the row and column covariances simultaneously, with transposable regularized covariance models, and de-correlating or sphering the data as a pre-processing step. Under reasonable assumptions, our method gives test statistics that follow the scaled theoretical null distribution and are approximately independent. Simulations based on various models with structured and observed covariances from real microarray data reveal that our method offers substantial improvements in two areas: 1) increased statistical power and 2) correct estimation of false discovery rates.

preprint2010arXiv

Strong rules for discarding predictors in lasso-type problems

We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE" rules that guarantee that a coefficient will be zero in the solution, based on the inner products of each predictor with the outcome. In this paper we propose strong rules that are not foolproof but rarely fail in practice. These can be complemented with simple checks of the Karush- Kuhn-Tucker (KKT) conditions to provide safe rules that offer substantial speed and space savings in a variety of statistical convex optimization problems.

preprint2010arXiv

Transposable regularized covariance models with an application to missing data imputation

Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so-called transposable regularized covariance models allow for maximum likelihood estimation of the mean and nonsingular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.

preprint2007arXiv

"Pre-conditioning" for feature selection and regression in high-dimensional problems

We consider regression problems where the number of predictors greatly exceeds the number of observations. We propose a method for variable selection that first estimates the regression function, yielding a "pre-conditioned" response variable. The primary method used for this initial regression is supervised principal components. Then we apply a standard procedure such as forward stepwise selection or the LASSO to the pre-conditioned response variable. In a number of simulated and real data examples, this two-step procedure outperforms forward stepwise selection or the usual LASSO (applied directly to the raw outcome). We also show that under a certain Gaussian latent variable model, application of the LASSO to the pre-conditioned response variable is consistent as the number of predictors and observations increases. Moreover, when the observational noise is rather large, the suggested procedure can give a more accurate estimate than LASSO. We illustrate our method on some real problems, including survival analysis with microarray data.

Robert Tibshirani

What is connected

Connect this record

See the researcher in context

Building this map preview

39 published item(s)

Univariate-Guided Interaction Modeling

Weakest link pruning of a dendrogram

Confidence Intervals for the Generalisation Error of Random Forests

FastCPH: Efficient Survival Analysis for Neural Networks

Assessment of Heterogeneous Treatment Effect Estimation Accuracy via Matching

Feature-weighted elastic net: using "features of features" for better prediction

Reluctant generalized additive modeling

Customized training with an application to mass spectrometric imaging of cancer tissue

High-dimensional regression adjustments in randomized experiments

Post-selection inference for L1-penalized likelihood models

Regularization for supervised learning via the "hubNet" procedure

A general framework for estimation and inference from clusters of features

A Selective Approach to Internal Inference

Convex hierarchical testing of interactions

Exact Post-Selection Inference for Sequential Regression Procedures

Post-selection point and interval estimation of signal sizes in Gaussian samples

Selecting the number of principal components: estimation of the true rank of a noisy matrix

Selective Sequential Model Selection

Sequential Selection Procedures and False Discovery Rate Control

Sparse regression and marginal testing using cluster prototypes

A significance test for the lasso

A Study of Error Variance Estimation in Lasso Regression

Collaborative Regression

Comment on "Detecting Novel Associations In Large Data Sets" by Reshef Et Al, Science Dec 16, 2011

Regularisation Paths for Conditional Logistic Regression: the clogitL1 package

Rejoinder: "A significance test for the lasso"

A lasso for hierarchical interactions

Adaptive testing for the graphical lasso

An Investigation of Methods for Handling Missing Data with Penalized Regression

False Variable Selection Rates in Regression

Sensitivity Analysis for Inference with Partially Identifiable Covariance Matrices

A Permutation Approach to Testing Interactions in Many Dimensions

A Simple Method for Detecting Interactions between a Treatment and a Large Number of Covariates

Prototype selection for interpretable classification

Bayesian Gene Set Analysis

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Strong rules for discarding predictors in lasso-type problems

Transposable regularized covariance models with an application to missing data imputation

"Pre-conditioning" for feature selection and regression in high-dimensional problems