Source author record

Bin Nan

Bin Nan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.ST Statistics Theory Applications Methodology

Catalog footprint

What is connected

11works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Conformalized Percentile Interval: Finite Sample Validity and Improved Conditional Performance

Conformal prediction provides distribution-free predictive intervals with finite-sample marginal coverage. However, achieving conditional validity and interval efficiency (in terms of short interval length) remains challenging, particularly in complex settings with heteroskedasticity, skewed responses, or estimation errors. We propose a conformal-style calibration method for responses obtained by the probability integral transform (PIT) of the conditional cumulative distribution function (CDF) estimated via neural networks to construct a finite-sample-adjusted percentile interval with the shortest length determined by the estimated conditional CDF. Calibrating in PIT space is effective because PIT values are asymptotically feature-independent when the CDF estimator is accurate, which mitigates feature-dependent miscoverage and improves conditional calibration. On the other hand, our percentile calibration adapts to the empirical PIT distribution, which is robust against a possibly imperfect estimation of the conditional CDF. We prove the finite-sample marginal coverage property of the proposed method and show its asymptotic conditional coverage under mild consistency conditions. Experiments on diverse synthetic and real-world benchmarks demonstrate better conditional calibration and substantially shorter intervals than existing methods.

preprint2022arXiv

Conditional Distribution Function Estimation Using Neural Networks for Censored and Uncensored Data

Most work in neural networks focuses on estimating the conditional mean of a continuous response variable given a set of covariates.In this article, we consider estimating the conditional distribution function using neural networks for both censored and uncensored data. The algorithm is built upon the data structure particularly constructed for the Cox regression with time-dependent covariates. Without imposing any model assumption, we consider a loss function that is based on the full likelihood where the conditional hazard function is the only unknown nonparametric parameter, for which unconstraint optimization methods can be applied. Through simulation studies, we show the proposed method possesses desirable performance, whereas the partial likelihood method and the traditional neural networks with $L_2$ loss yield biased estimates when model assumptions are violated. We further illustrate the proposed method with several real-world data sets. The implementation of the proposed methods is made available at https://github.com/bingqing0729/NNCDE.

preprint2020arXiv

A Revisit to De-biased Lasso for Generalized Linear Models

De-biased lasso has emerged as a popular tool to draw statistical inference for high-dimensional regression models. However, simulations indicate that for generalized linear models (GLMs), de-biased lasso inadequately removes biases and yields unreliable confidence intervals. This motivates us to scrutinize the application of de-biased lasso in high-dimensional GLMs. When $p >n$, we detect that a key sparsity condition on the inverse information matrix generally does not hold in a GLM setting, which likely explains the subpar performance of de-biased lasso. Even in a less challenging "large $n$, diverging $p$" scenario, we find that de-biased lasso and the maximum likelihood method often yield confidence intervals with unsatisfactory coverage probabilities. In this scenario, we examine an alternative approach for further bias correction by directly inverting the Hessian matrix without imposing the matrix sparsity assumption. We establish the asymptotic distributions of any linear combinations of the resulting estimates, which lay the theoretical groundwork for drawing inference. Simulations show that this refined de-biased estimator performs well in removing biases and yields an honest confidence interval coverage. We illustrate the method by analyzing a prospective hospital-based Boston Lung Cancer Study, a large scale epidemiology cohort investigating the joint effects of genetic variants on lung cancer risk.

preprint2016arXiv

Multiple Testing for Neuroimaging via Hidden Markov Random Field

Traditional voxel-level multiple testing procedures in neuroimaging, mostly $p$-value based, often ignore the spatial correlations among neighboring voxels and thus suffer from substantial loss of power. We extend the local-significance-index based procedure originally developed for the hidden Markov chain models, which aims to minimize the false nondiscovery rate subject to a constraint on the false discovery rate, to three-dimensional neuroimaging data using a hidden Markov random field model. A generalized expectation-maximization algorithm for maximizing the penalized likelihood is proposed for estimating the model parameters. Extensive simulations show that the proposed approach is more powerful than conventional false discovery rate procedures. We apply the method to the comparison between mild cognitive impairment, a disease status with increased risk of developing Alzheimer's or another dementia, and normal controls in the FDG-PET imaging study of the Alzheimer's Disease Neuroimaging Initiative.

preprint2014arXiv

Regularized 3D functional regression for brain image data via Haar wavelets

The primary motivation and application in this article come from brain imaging studies on cognitive impairment in elderly subjects with brain disorders. We propose a regularized Haar wavelet-based approach for the analysis of three-dimensional brain image data in the framework of functional data analysis, which automatically takes into account the spatial information among neighboring voxels. We conduct extensive simulation studies to evaluate the prediction performance of the proposed approach and its ability to identify related regions to the outcome of interest, with the underlying assumption that only few relatively small subregions are truly predictive of the outcome of interest. We then apply the proposed approach to searching for brain subregions that are associated with cognition using PET images of patients with Alzheimer's disease, patients with mild cognitive impairment and normal controls.

preprint2014arXiv

Semiparametric Approach for Regression with Covariate Subject to Limit of Detection

We consider generalized linear regression analysis with left-censored covariate due to the lower limit of detection. Complete case analysis by eliminating observations with values below limit of detection yields valid estimates for regression coefficients, but loses efficiency; substitution methods are biased; maximum likelihood method relies on parametric models for the unobservable tail probability distribution of such covariate, thus may suffer from model misspecification. To obtain robust and more efficient results, we propose a semiparametric likelihood-based approach for the estimation of regression parameters using an accelerated failure time model for the covariate subject to limit of detection. A two-stage estimation procedure is considered, where the conditional distribution of the covariate with limit of detection given other variables is estimated prior to maximizing the likelihood function. The proposed method outperforms the complete case analysis and the substitution methods as well in simulation studies. Technical conditions for desirable asymptotic properties are provided.

preprint2013arXiv

Estimating mean survival time: when is it possible?

For right censored survival data, it is well known that the mean survival time can be consistently estimated when the support of the censoring time contains the support of the survival time. In practice, however, this condition can be easily violated because the follow-up of a study is usually within a finite window. In this article we show that the mean survival time is still estimable from a linear model when the support of some covariate(s) with nonzero coefficient(s) is unbounded regardless of the length of follow-up. This implies that the mean survival time can be well estimated when the covariate range is wide in practice. The theoretical finding is further verified for finite samples by simulation studies. Simulations also show that, when both models are correctly specified, the linear model yields reasonable mean square prediction errors and outperforms the Cox model, particularly with heavy censoring and short follow-up time.

preprint2012arXiv

A general semiparametric Z-estimation approach for case-cohort studies

Case-cohort design, an outcome-dependent sampling design for censored survival data, is increasingly used in biomedical research. The development of asymptotic theory for a case-cohort design in the current literature primarily relies on counting process stochastic integrals. Such an approach, however, is rather limited and lacks theoretical justification for outcome-dependent weighted methods due to non-predictability. Instead of stochastic integrals, we derive asymptotic properties for case-cohort studies based on a general Z-estimation theory for semiparametric models with bundled parameters using modern empirical processes. Both the Cox model and the additive hazards model with time-dependent covariates are considered.

preprint2012arXiv

A sieve M-theorem for bundled parameters in semiparametric models, with application to the efficient estimation in a linear model for censored data

In many semiparametric models that are parameterized by two types of parameters---a Euclidean parameter of interest and an infinite-dimensional nuisance parameter---the two parameters are bundled together, that is, the nuisance parameter is an unknown function that contains the parameter of interest as part of its argument. For example, in a linear regression model for censored survival data, the unspecified error distribution function involves the regression coefficients. Motivated by developing an efficient estimating method for the regression parameters, we propose a general sieve M-theorem for bundled parameters and apply the theorem to deriving the asymptotic theory for the sieve maximum likelihood estimation in the linear regression model for censored survival data. The numerical implementation of the proposed estimating method can be achieved through the conventional gradient-based search algorithms such as the Newton--Raphson algorithm. We show that the proposed estimator is consistent and asymptotically normal and achieves the semiparametric efficiency bound. Simulation studies demonstrate that the proposed method performs well in practical settings and yields more efficient estimates than existing estimating equation based methods. Illustration with a real data example is also provided.

preprint2012arXiv

Non-asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

We consider the finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz. We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulty caused by the lack of iid and Lipschitz property.

preprint2011arXiv

Random lasso

We propose a computationally intensive method, the random lasso method, for variable selection in linear models. The method consists of two major steps. In step 1, the lasso method is applied to many bootstrap samples, each using a set of randomly selected covariates. A measure of importance is yielded from this step for each covariate. In step 2, a similar procedure to the first step is implemented with the exception that for each bootstrap sample, a subset of covariates is randomly selected with unequal selection probabilities determined by the covariates' importance. Adaptive lasso may be used in the second step with weights determined by the importance measures. The final set of covariates and their coefficients are determined by averaging bootstrap results obtained from step 2. The proposed method alleviates some of the limitations of lasso, elastic-net and related methods noted especially in the context of microarray data analysis: it tends to remove highly correlated variables altogether or select them all, and maintains maximal flexibility in estimating their coefficients, particularly with different signs; the number of selected variables is no longer limited by the sample size; and the resulting prediction accuracy is competitive or superior compared to the alternatives. We illustrate the proposed method by extensive simulation studies. The proposed method is also applied to a Glioblastoma microarray data analysis.

Bin Nan

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Conformalized Percentile Interval: Finite Sample Validity and Improved Conditional Performance

Conditional Distribution Function Estimation Using Neural Networks for Censored and Uncensored Data

A Revisit to De-biased Lasso for Generalized Linear Models

Multiple Testing for Neuroimaging via Hidden Markov Random Field

Regularized 3D functional regression for brain image data via Haar wavelets

Semiparametric Approach for Regression with Covariate Subject to Limit of Detection

Estimating mean survival time: when is it possible?

A general semiparametric Z-estimation approach for case-cohort studies

A sieve M-theorem for bundled parameters in semiparametric models, with application to the efficient estimation in a linear model for censored data

Non-asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

Random lasso