Source author record

Kwun Chuen Gary Chan

Kwun Chuen Gary Chan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Machine Learning

Catalog footprint

What is connected

7works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Dimension-reduced outcome-weighted learning for estimating individualized treatment regimes in observational studies

Individualized treatment regimes (ITRs) aim to improve clinical outcomes by assigning treatment based on patient-specific characteristics. However, existing methods often struggle with high-dimensional covariates, limiting accuracy, interpretability, and real-world applicability. We propose a novel sufficient dimension reduction approach that directly targets the contrast between potential outcomes and identifies a low-dimensional subspace of the covariates capturing treatment effect heterogeneity. This reduced representation enables more accurate estimation of optimal ITRs through outcome-weighted learning. To accommodate observational data, our method incorporates kernel-based covariate balancing, allowing treatment assignment to depend on the full covariate set and avoiding the restrictive assumption that the subspace sufficient for modeling heterogeneous treatment effects is also sufficient for confounding adjustment. We show that the proposed method achieves universal consistency, i.e., its risk converges to the Bayes risk, under mild regularity conditions. We demonstrate its finite sample performance through simulations and an analysis of intensive care unit sepsis patient data to determine who should receive transthoracic echocardiography.

preprint2021arXiv

Defining and Estimating Subgroup Mediation Effects with Semi-Competing Risks Data

In many medical studies, an ultimate failure event such as death is likely to be affected by the occurrence and timing of other intermediate clinical events. Both event times are subject to censoring by loss-to-follow-up but the nonterminal event may further be censored by the occurrence of the primary outcome, but not vice versa. To study the effect of an intervention on both events, the intermediate event may be viewed as a mediator, but conventional definition of direct and indirect effects is not applicable due to semi-competing risks data structure. We define three principal strata based on whether the potential intermediate event occurs before the potential failure event, which allow proper definition of direct and indirect effects in one stratum whereas total effects are defined for all strata. We discuss the identification conditions for stratum-specific effects, and proposed a semiparametric estimator based on a multivariate logistic stratum membership model and within-stratum proportional hazards models for the event times. By treating the unobserved stratum membership as a latent variable, we propose an EM algorithm for computation. We study the asymptotic properties of the estimators by the modern empirical process theory and examine the performance of the estimators in numerical studies.

preprint2021arXiv

Estimation of Partially Conditional Average Treatment Effect by Hybrid Kernel-covariate Balancing

We study nonparametric estimation for the partially conditional average treatment effect, defined as the treatment effect function over an interested subset of confounders. We propose a hybrid kernel weighting estimator where the weights aim to control the balancing error of any function of the confounders from a reproducing kernel Hilbert space after kernel smoothing over the subset of interested variables. In addition, we present an augmented version of our estimator which can incorporate estimations of outcome mean functions. Based on the representer theorem, gradient-based algorithms can be applied for solving the corresponding infinite-dimensional optimization problem. Asymptotic properties are studied without any smoothness assumptions for propensity score function or the need of data splitting, relaxing certain existing stringent assumptions. The numerical performance of the proposed estimator is demonstrated by a simulation study and an application to the effect of a mother's smoking on a baby's birth weight conditioned on the mother's age.

preprint2020arXiv

Controlling the False Discovery Rate for Binary Feature Selection via Knockoff

Variable selection has been widely used in data analysis for the past decades, and it becomes increasingly important in the Big Data era as there are usually hundreds of variables available in a dataset. To enhance interpretability of a model, identifying potentially relevant features is often a step before fitting all the features into a regression model. A good variable selection method should effectively control the fraction of false discoveries and ensure large enough power of its selection set. In a lot of contemporary data applications, a great portion of features are coded as binary variables. Binary features are widespread in many fields, from online controlled experiments to genome science to physical statistics. Although there has recently been a handful of literature for provable false discovery rate (FDR) control in variable selection, most of the theoretical analyses were based on some strong dependency assumption or Gaussian assumption among features. In this paper we propose a variable selection method in regression framework for selecting binary features. Under mild conditions, we show that FDR is controlled exactly under a target level in a finite sample if the underlying distribution of the binary features is known. We show in simulations that FDR control is still attained when feature distribution is estimated from data. We also provide theoretical results on the power of our variables selection method in a linear regression model or a logistic regression model. In the restricted settings where competitors exist, we show in simulations and real data application on a HIV antiretroviral therapy dataset that our method has higher power than the competitor.

preprint2014arXiv

Marginalizable conditional model for clustered ordinal data

We introduce a flexible parametric mixed effects model for correlated binary data, with parameters that can be directly interpreted as marginal odds ratios. This leads to a robust estimation equation with an optimal weighting matrix being the inverse of a genuine model-based covariance matrix. Flexible correlation structures can be imposed by correlated random effects, and correlation parameters can be estimated by solving a composite likelihood score function. Marginal parameters are consistently estimated even when the conditional parametric model is misspecified, and the robust estimation procedure has low estimation efficiency loss compared to the maximum likelihood estimation under a correct model specification. Simulations, analyses of the Madras longitudinal schizophrenia study and British social attributes panel survey were carried out to demonstrate our method.

preprint2014arXiv

Oracle, Multiple Robust and Multipurpose Calibration in a Missing Response Problem

In the presence of a missing response, reweighting the complete case subsample by the inverse of nonmissing probability is both intuitive and easy to implement. When the population totals of some auxiliary variables are known and when the inclusion probabilities are known by design, survey statisticians have developed calibration methods for improving efficiencies of the inverse probability weighting estimators and the methods can be applied to missing data analysis. Model-based calibration has been proposed in the survey sampling literature, where multidimensional auxiliary variables are first summarized into a predictor function from a working regression model. Usually, one working model is being proposed for each parameter of interest and results in different sets of calibration weights for estimating different parameters. This paper considers calibration using multiple working regression models for estimating a single or multiple parameters. Contrary to a common belief that overfitting hurts efficiency, we present three rather unexpected results. First, when the missing probability is correctly specified and multiple working regression models for the conditional mean are posited, calibration enjoys an oracle property: the same semiparametric efficiency bound is attained as if the true outcome model is known in advance. Second, when the missing data mechanism is misspecified, calibration can still be a consistent estimator when any one of the outcome regression models is correctly specified. Third, a common set of calibration weights can be used to improve efficiency in estimating multiple parameters of interest and can simultaneously attain semiparametric efficiency bounds for all parameters of interest. We provide connections of a wide class of calibration estimators, constructed based on generalized empirical likelihood, to many existing estimators in biostatistics, econometrics and survey sampling and perform simulation studies to show that the finite sample properties of calibration estimators conform well with the theoretical results being studied.

preprint2010arXiv

Backward estimation of stochastic processes with failure events as time origins

Stochastic processes often exhibit sudden systematic changes in pattern a short time before certain failure events. Examples include increase in medical costs before death and decrease in CD4 counts before AIDS diagnosis. To study such terminal behavior of stochastic processes, a natural and direct way is to align the processes using failure events as time origins. This paper studies backward stochastic processes counting time backward from failure events, and proposes one-sample nonparametric estimation of the mean of backward processes when follow-up is subject to left truncation and right censoring. We will discuss benefits of including prevalent cohort data to enlarge the identifiable region and large sample properties of the proposed estimator with related extensions. A SEER--Medicare linked data set is used to illustrate the proposed methodologies.

Kwun Chuen Gary Chan

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Dimension-reduced outcome-weighted learning for estimating individualized treatment regimes in observational studies

Defining and Estimating Subgroup Mediation Effects with Semi-Competing Risks Data

Estimation of Partially Conditional Average Treatment Effect by Hybrid Kernel-covariate Balancing

Controlling the False Discovery Rate for Binary Feature Selection via Knockoff

Marginalizable conditional model for clustered ordinal data

Oracle, Multiple Robust and Multipurpose Calibration in a Missing Response Problem

Backward estimation of stochastic processes with failure events as time origins