Source author record

Saharon Rosset

Saharon Rosset appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning Applications Information Theory math.IT math.ST Populations and Evolution Quantitative Methods Statistics Theory Genomics math.OC Systems and Control

Catalog footprint

What is connected

14works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Semi-Supervised Empirical Risk Minimization: Using unlabeled data to improve prediction

We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions.

preprint2021arXiv

On the cross-validation bias due to unsupervised pre-processing

Cross-validation is the de facto standard for predictive model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of data-dependent preprocessing, such as mean-centering, rescaling, dimensionality reduction, and outlier removal. It is often believed that such preprocessing stages, if done in an unsupervised manner (that does not incorporate the class labels or response values) are generally safe to do prior to cross-validation. In this paper, we study three commonly-practiced preprocessing procedures prior to a regression analysis: (i) variance-based feature selection; (ii) grouping of rare categorical features; and (iii) feature rescaling. We demonstrate that unsupervised preprocessing can, in fact, introduce a substantial bias into cross-validation estimates and potentially hurt model selection. This bias may be either positive or negative and its exact magnitude depends on all the parameters of the problem in an intricate manner. Further research is needed to understand the real-world impact of this bias across different application domains, particularly when dealing with small sample sizes and high-dimensional data.

preprint2020arXiv

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $\ell_2$ norm ("ridgeless") interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors $x_i \in {\mathbb R}^p$ are obtained by applying a linear transform to a vector of i.i.d. entries, $x_i = Σ^{1/2} z_i$ (with $z_i \in {\mathbb R}^p$); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, $x_i = φ(W z_i)$ (with $z_i \in {\mathbb R}^d$, $W \in {\mathbb R}^{p \times d}$ a matrix of i.i.d. entries, and $φ$ an activation function acting componentwise on $W z_i$). We recover -- in a precise quantitative way -- several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

preprint2016arXiv

Large Alphabet Source Coding using Independent Component Analysis

Large alphabet source coding is a basic and well-studied problem in data compression. It has many applications such as compression of natural language text, speech and images. The classic perception of most commonly used methods is that a source is best described over an alphabet which is at least as large as the observed alphabet. In this work we challenge this approach and introduce a conceptual framework in which a large alphabet source is decomposed into "as statistically independent as possible" components. This decomposition allows us to apply entropy encoding to each component separately, while benefiting from their reduced alphabet size. We show that in many cases, such decomposition results in a sum of marginal entropies which is only slightly greater than the entropy of the source. Our suggested algorithm, based on a generalization of the Binary Independent Component Analysis, is applicable for a variety of large alphabet source coding setups. This includes the classical lossless compression, universal compression and high-dimensional vector quantization. In each of these setups, our suggested approach outperforms most commonly used methods. Moreover, our proposed framework is significantly easier to implement in most of these cases.

preprint2015arXiv

Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

Recursive partitioning approaches producing tree-like models are a long standing staple of predictive modeling, in the last decade mostly as ``sub-learners'' within state of the art ensemble methods like Boosting and Random Forest. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. Such variables can often be very informative, but current tree methods essentially leave us a choice of either not using them, or exposing our models to severe overfitting. We propose a conceptual framework to splitting using leave-one-out (LOO) cross validation for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our novel splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not increase the overall computational complexity compared to CART for two-class classification. For regression tasks, our approach carries an increased computational burden, replacing a O(log(n)) factor in CART splitting rule search with an O(n) term.

preprint2015arXiv

Generalized Independent Component Analysis Over Finite Alphabets

Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other.Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA in which both the observations and the independent components are over a finite alphabet. In this work we consider a generalization of this framework in which an observation vector is decomposed to its independent components (as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow's minimal redundancy representation problem and is considered an open problem. We propose several theorems and show that this NP hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Our contribution provides the first efficient and constructive set of solutions to Barlow's problem.The minimal redundancy representation (also known as factorial code) has many applications, mainly in the fields of Neural Networks and Deep Learning. The Binary ICA (BICA) is also shown to have applications in several domains including medical diagnosis, multi-cluster assignment, network tomography and internet resource management. In this work we show this formulation further applies to multiple disciplines in source coding such as predictive coding, distributed source coding and coding of large alphabet sources.

preprint2014arXiv

Effective Genetic Risk Prediction Using Mixed Models

To date, efforts to produce high-quality polygenic risk scores from genome-wide studies of common disease have focused on estimating and aggregating the effects of multiple SNPs. Here we propose a novel statistical approach for genetic risk prediction, based on random and mixed effects models. Our approach (termed GeRSI) circumvents the need to estimate the effect sizes of numerous SNPs by treating these effects as random, producing predictions which are consistently superior to current state of the art, as we demonstrate in extensive simulation. When applying GeRSI to seven phenotypes from the WTCCC study, we confirm that the use of random effects is most beneficial for diseases that are known to be highly polygenic: hypertension (HT) and bipolar disorder (BD). For HT, there are no significant associations in the WTCCC data. The best existing model yields an AUC of 54%, while GeRSI improves it to 59%. For BD, using GeRSI improves the AUC from 55% to 62%. For individuals ranked at the top 10% of BD risk predictions, using GeRSI substantially increases the BD relative risk from 1.4 to 2.5.

preprint2013arXiv

Generalized Alpha Investing: Definitions, Optimality Results, and Application to Public Databases

The increasing prevalence and utility of large, public databases necessitates the development of appropriate methods for controlling false discovery. Motivated by this challenge, we discuss the generic problem of testing a possibly infinite stream of null hypotheses. In this context, Foster and Stine (2008) suggested a novel method named Alpha Investing for controlling a false discovery measure known as mFDR. We develop a more general procedure for controlling mFDR, of which Alpha Investing is a special case. We show that in common, practical situations, the general procedure can be optimized to produce an expected reward optimal (ERO) version, which is more powerful than Alpha Investing. We then present the concept of quality preserving databases (QPD), originally introduced in Aharoni et al. (2011), which formalizes efficient public database management to simultaneously save costs and control false discovery. We show how one variant of generalized alpha investing can be used to control mFDR in a QPD and lead to significant reduction in costs compared to naive approaches for controlling the Family-Wise Error Rate implemented in Aharoni et al. (2011).

preprint2013arXiv

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS

One of the major developments in recent years in the search for missing heritability of human phenotypes is the adoption of linear mixed-effects models (LMMs) to estimate heritability due to genetic variants which are not significantly associated with the phenotype. A variant of the LMM approach has been adapted to case-control studies and applied to many major diseases by Lee et al. (2011), successfully accounting for a considerable portion of the missing heritability. For example, for Crohn's disease their estimated heritability was 22% compared to 50-60% from family studies. In this letter we propose to estimate heritability of disease directly by regression of phenotype similarities on genotype correlations, corrected to account for ascertainment. We refer to this method as genetic correlation regression (GCR). Using GCR we estimate the heritability of Crohn's disease at 34% using the same data. We demonstrate through extensive simulation that our method yields unbiased heritability estimates, which are consistently higher than LMM estimates. Moreover, we develop a heuristic correction to LMM estimates, which can be applied to published LMM results. Applying our heuristic correction increases the estimated heritability of multiple sclerosis from 30% to 52.6%.

preprint2013arXiv

When Does More Regularization Imply Fewer Degrees of Freedom? Sufficient Conditions and Counter Examples from Lasso and Ridge Regression

Regularization aims to improve prediction performance of a given statistical modeling approach by moving to a second approach which achieves worse training error but is expected to have fewer degrees of freedom, i.e., better agreement between training and prediction error. We show here, however, that this expected behavior does not hold in general. In fact, counter examples are given that show regularization can increase the degrees of freedom in simple situations, including lasso and ridge regression, which are the most common regularization approaches in use. In such situations, the regularization increases both training error and degrees of freedom, and is thus inherently without merit. On the other hand, two important regularization scenarios are described where the expected reduction in degrees of freedom is indeed guaranteed: (a) all symmetric linear smoothers, and (b) linear regression versus convex constrained linear regression (as in the constrained variant of ridge regression and lasso).

preprint2012arXiv

Efficient regularized isotonic regression with application to gene--gene interaction search

Isotonic regression is a nonparametric approach for fitting monotonic models to data that has been widely studied from both theoretical and practical perspectives. However, this approach encounters computational and statistical overfitting issues in higher dimensions. To address both concerns, we present an algorithm, which we term Isotonic Recursive Partitioning (IRP), for isotonic regression based on recursively partitioning the covariate space through solution of progressively smaller "best cut" subproblems. This creates a regularized sequence of isotonic models of increasing model complexity that converges to the global isotonic regression solution. The models along the sequence are often more accurate than the unregularized isotonic regression model because of the complexity control they offer. We quantify this complexity control through estimation of degrees of freedom along the path. Success of the regularized models in prediction and IRPs favorable computational properties are demonstrated through a series of simulated and real data experiments. We discuss application of IRP to the problem of searching for gene--gene interactions and epistasis, and demonstrate it on data from genome-wide association studies of three common diseases.

preprint2012arXiv

Generalized Isotonic Regression

We present a computational and statistical approach for fitting isotonic models under convex differentiable loss functions. We offer a recursive partitioning algorithm which provably and efficiently solves isotonic regression under any such loss function. Models along the partitioning path are also isotonic and can be viewed as regularized solutions to the problem. Our approach generalizes and subsumes two previous results: the well-known work of Barlow and Brunk (1972) on fitting isotonic regressions subject to specially structured loss functions, and a recursive partitioning algorithm (Spouge et al 2003) for the case of standard (l2-loss) isotonic regression. We demonstrate the advantages of our generalized algorithm on both real and simulated data in two settings: fitting count data using negative Poisson log-likelihood loss, and fitting robust isotonic regression using Huber's loss.

preprint2011arXiv

Random lasso

We propose a computationally intensive method, the random lasso method, for variable selection in linear models. The method consists of two major steps. In step 1, the lasso method is applied to many bootstrap samples, each using a set of randomly selected covariates. A measure of importance is yielded from this step for each covariate. In step 2, a similar procedure to the first step is implemented with the exception that for each bootstrap sample, a subset of covariates is randomly selected with unequal selection probabilities determined by the covariates' importance. Adaptive lasso may be used in the second step with weights determined by the importance measures. The final set of covariates and their coefficients are determined by averaging bootstrap results obtained from step 2. The proposed method alleviates some of the limitations of lasso, elastic-net and related methods noted especially in the context of microarray data analysis: it tends to remove highly correlated variables altogether or select them all, and maintains maximal flexibility in estimating their coefficients, particularly with different signs; the number of selected variables is no longer limited by the sample size; and the resulting prediction accuracy is competitive or superior compared to the alternatives. We illustrate the proposed method by extensive simulation studies. The proposed method is also applied to a Glioblastoma microarray data analysis.

preprint2010arXiv

Preliminary Report: Missense mutations in the APOL gene family are associated with end stage kidney disease risk previously attributed to the MYH9 gene

MYH9 has been proposed as a major genetic risk locus for a spectrum of non-diabetic end stage kidney disease (ESKD). We use recently released sequences from the 1000 Genomes Project to identify two western African specific missense mutations (S342G and I384M) in the neighbouring APOL1 gene, and demonstrate that these are more strongly associated with ESKD than previously reported MYH9 variants. We also show that the distribution of these risk variants in African populations is consistent with the pattern of African ancestry ESKD risk previously attributed to the MYH9 gene. Additional associations were also found among other members of the APOL gene family, and we propose that ESKD risk is caused by western African variants in members of the APOL gene family, which evolved to confer protection against pathogens, such as Trypanosoma.

Saharon Rosset

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Semi-Supervised Empirical Risk Minimization: Using unlabeled data to improve prediction

On the cross-validation bias due to unsupervised pre-processing

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Large Alphabet Source Coding using Independent Component Analysis

Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

Generalized Independent Component Analysis Over Finite Alphabets

Effective Genetic Risk Prediction Using Mixed Models

Generalized Alpha Investing: Definitions, Optimality Results, and Application to Public Databases

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS

When Does More Regularization Imply Fewer Degrees of Freedom? Sufficient Conditions and Counter Examples from Lasso and Ridge Regression

Efficient regularized isotonic regression with application to gene--gene interaction search

Generalized Isotonic Regression

Random lasso

Preliminary Report: Missense mutations in the APOL gene family are associated with end stage kidney disease risk previously attributed to the MYH9 gene