Source author record

Molei Liu

Molei Liu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology

Catalog footprint

What is connected

6works

1topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Model-X Change-Point Detection of Conditional Distribution

The dynamic nature of many real-world systems can lead to temporal outcome model shifts, causing a deterioration in model accuracy and reliability over time. This requires change-point detection on the outcome models to guide model retraining and adjustments. However, inferring the change point of conditional models is more prone to loss of validity or power than classic detection problems for marginal distributions. This is due to both the temporal covariate shift and the complexity of the outcome model. Also, the existing method of conditional change points detection both have many limitations including linear assumption and low dimension prerequisite which sometimes is not suitable for real world application. To address these challenges, we propose a novel Model-X changE-point detectioN of conditional Distribution (MEND) method computationally enhanced with distillation function for simultaneous change-point detection and localization of the conditional outcome model. We extend and combine our model with neural network to accommodate complex nonlinear and high dimensional situation, which is proved to be valid in both simulation and real data. Theoretical validity of the proposed method is justified. Extensive simulation studies and two real-world examples demonstrate the statistical effectiveness and computational scalability of our method as well as its significant improvements over existing methods.

preprint2022arXiv

Augmented Transfer Regression Learning with Semi-non-parametric Nuisance Models

In contemporary statistical learning, covariate shift correction plays an important role in transfer learning when distribution of the testing data is shifted from the training data. Importance weighting, as a natural and principle strategy to adjust for covariate shift, has been commonly used in the field of transfer learning. However, this strategy is not robust to model misspecification or excessive estimation error. In this paper, we propose an augmented transfer regression learning (ATReL) approach that introduces an imputation model for the targeted response, and uses it to augment the importance weighting equation. With novel semi-non-parametric constructions and calibrated moment estimating equations for the two nuisance models, our ATReL method is less prone to (i) the curse of dimensionality compared to nonparametric approaches, and (ii) model mis-specification than parametric approaches. We show that our ATReL estimator is root-n-consistent when at least one nuisance model is correctly specified, estimation for the parametric part of the nuisance models achieves parametric rate, and the nonparametric components are rate doubly robust. Simulation studies demonstrate that our method is more robust and efficient than existing parametric and fully nonparametric (machine learning) estimators under various configurations. We also examine the utility of our method through a real example about transfer learning of phenotyping algorithm for rheumatoid arthritis across different time windows. Finally, we propose ways to enhance the intrinsic efficiency of our estimator and to incorporate modern machine learning methods with our proposed framework.

preprint2020arXiv

A Note on Debiased/Double Machine Learning Logistic Partially Linear Model

It is of particular interests in many application fields to draw doubly robust inference of a logistic partially linear model with the predictor specified as combination of a targeted low dimensional linear parametric function and a nuisance nonparametric function. In recent, Tan (2019) proposed a simple and flexible doubly robust estimator for this purpose. They introduced the two nuisance models, i.e. nonparametric component in the logistic model and conditional mean of the exposure covariates given the other covariates and fixed response, and specified them as fixed dimensional parametric models. Their framework could be potentially extended to machine learning or high dimensional nuisance modelling exploited recently, e.g. in Chernozhukovet al. (2018a,b) and Smucler et al. (2019); Tan (2020). Motivated by this, we derive the debiased/double machine learning logistic partially linear model in this note. For construction of the nuisance models, we separately consider the use of high dimensional sparse parametric models and general machine learning methods. By deriving certain moment equations to calibrate the first order bias of the nuisance models, we preserve a model double robustness property on high dimensional ultra-sparse nuisance models. We also discuss and compare the underlying assumption of our method with debiased LASSO (Van deGeer et al., 2014). To implement the machine learning proposal, we design a full model refitting procedure that allows the use of any blackbox conditional mean estimation method in our framework. Under the machine learning setting, our method is rate doubly robust in a similar sense as Chernozhukov et al. (2018a).

preprint2020arXiv

Individual Data Protected Integrative Regression Analysis of High-dimensional Heterogeneous Data

Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high dimensional setting. The challenge is even more pronounced when the individual level data cannot be shared across studies, known as DataSHIELD constraint (Wolfson et al., 2010). Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.

preprint2020arXiv

Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints

Identifying informative predictors in a high dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improve power is through meta-analyzing multiple studies on the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data but not individual level data can be shared across different sites. In this paper, we propose a novel data shielding integrative large-scale testing (DSILT) approach to signal detection by allowing between study heterogeneity and not requiring sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the DSILT approach incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling for false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the DSILT procedure with the ideal individual--level meta--analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the DSILT procedure performs well in both false discovery control and attaining power. The proposed method is applied to a real example on detecting interaction effect of the genetic variants for statins and obesity on the risk for Type 2 Diabetes.

preprint2016arXiv

Joint Models for Time-to-Event Data and Longitudinal Biomarkers of High Dimension

Joint models for longitudinal biomarkers and time-to-event data are widely used in longitudinal studies. Many joint modeling approaches have been proposed to deal with different types of longitudinal biomarkers and survival outcomes. However, most existing joint modeling methods cannot deal with a large number of longitudinal biomarkers simultaneously, such as the longitudinally collected gene expression profiles. In this article, we propose a new joint modeling method under the Bayesian framework, which is able to deal with longitudinal biomarkers of high dimension. Specifically, we assume that only a few unobserved latent variables are related to the survival outcome and the latent variables are inferred using a factor analysis model, which greatly reduces the dimensionality of the biomarkers and also accounts for the high correlations among the biomarkers. Through extensive simulation studies, we show that our proposed method has improved prediction accuracy over other joint modeling methods. We illustrate the usefulness of our method on a dataset of idiopathic pulmonary fibrosis patients in which we are interested in predicting the patients' time-to-death using their gene expression profiles.

Molei Liu

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

Model-X Change-Point Detection of Conditional Distribution

Augmented Transfer Regression Learning with Semi-non-parametric Nuisance Models

A Note on Debiased/Double Machine Learning Logistic Partially Linear Model

Individual Data Protected Integrative Regression Analysis of High-dimensional Heterogeneous Data

Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints

Joint Models for Time-to-Event Data and Longitudinal Biomarkers of High Dimension