Source author record

Michael R. Kosorok

Michael R. Kosorok appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning math.ST Statistics Theory Applications Computation

Catalog footprint

What is connected

18works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces

We study estimation of the conditional law $P(Y|X=x)$ and continuous functionals $Ψ(P(Y|X=x))$ when $Y$ takes values in a locally compact Polish space, $X \in \mathbb{R}^p$, and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.

preprint2022arXiv

Discussion of Multiscale Fisher's Independence Test for Multivariate Dependence

The multiscale Fisher's independence test (MULTIFIT hereafter) proposed by Gorsky & Ma (2022) is a novel method to test independence between two random vectors. By its design, this test is particularly useful in detecting local dependence. Moreover, by adopting a resampling-free approach, it can easily accommodate massive sample sizes. Another benefit of the proposed method is its ability to interpret the nature of dependency. We congratulate the authors, Shai Gorksy and Li Ma, for their very interesting and elegant work. In this comment, we would like to discuss a general framework unifying the MULTIFIT and other tests and compare it with the binary expansion randomized ensemble test (BERET hereafter) proposed by Lee et al. (In press). We also would like to contribute our thoughts on potential extensions of the method.

preprint2022arXiv

Multi-stage optimal dynamic treatment regimes for survival outcomes with dependent censoring

We propose a reinforcement learning method for estimating an optimal dynamic treatment regime for survival outcomes with dependent censoring. The estimator allows the failure time to be conditionally independent of censoring and dependent on the treatment decision times, supports a flexible number of treatment arms and treatment stages, and can maximize either the mean survival time or the survival probability at a certain time point. The estimator is constructed using generalized random survival forests and can have polynomial rates of convergence. Simulations and data analysis results suggest that the new estimator brings higher expected outcomes than existing methods in various settings. An R package dtrSurv is available on CRAN.

preprint2022arXiv

Risk-Adjusted Incidence Modeling on Hierarchical Survival Data with Recurrent Events

There is a constant need for many healthcare programs to timely address problems with infection prevention and control (IP&C). For example, pathogens can be transmitted among patients with cystic fibrosis (CF) in both the inpatient and outpatient settings within the healthcare system even with the existing recommended IP&C practices, and these pathogens are often associated with negative clinical outcomes. Because of limited and delayed data sharing, CF programs need a reliable method to track infection rates. There are three complex structures in CF registry data: recurrent infections, missing data, and multilevel correlation due to repeated measures within a patient and patient-to-patient transmissions. A step-by-step analysis pipeline was proposed to develop and validate a risk-adjusted model to help healthcare programs monitor the number of recurrent events while taking into account missing data and the hierarchies of repeated measures in right-censored data. We extended the mixed-effect Andersen-Gill model (the frailty model), adjusted for important risk factors, and provided confidence intervals for the predicted number of events where the variability of the prediction was estimated from three identified sources. The coverage of the estimated confidence intervals was used to evaluate model performance. Simulation results indicated that the coverage of our method was close to the desired confidence level. To demonstrate its clinical practicality, our pipeline was applied to monitor the infection incidence rate of two key CF pathogens using a U.S. registry. Results showed that years closer to the time of interest were better at predicting future incidence rates in the CF example.

preprint2021arXiv

The Binary Expansion Randomized Ensemble Test (BERET)

Recently, the binary expansion testing framework was introduced to test the independence of two continuous random variables by utilizing symmetry statistics that are complete sufficient statistics for dependence. We develop a new test based on an ensemble approach that uses the sum of squared symmetry statistics and distance correlation. Simulation studies suggest that this method improves the power while preserving the clear interpretation of the binary expansion testing. We extend this method to tests of independence of random vectors in arbitrary dimension. Through random projections, the proposed binary expansion randomized ensemble test transforms the multivariate independence testing problem into a univariate problem. Simulation studies and data example analyses show that the proposed method provides relatively robust performance compared with existing methods.

preprint2020arXiv

Estimation and Optimization of Composite Outcomes

There is tremendous interest in precision medicine as a means to improve patient outcomes by tailoring treatment to individual characteristics. An individualized treatment rule formalizes precision medicine as a map from patient information to a recommended treatment. A treatment rule is defined to be optimal if it maximizes the mean of a scalar outcome in a population of interest, e.g., symptom reduction. However, clinical and intervention scientists often must balance multiple and possibly competing outcomes, e.g., symptom reduction and the risk of an adverse event. One approach to precision medicine in this setting is to elicit a composite outcome which balances all competing outcomes; unfortunately, eliciting a composite outcome directly from patients is difficult without a high-quality instrument, and an expert-derived composite outcome may not account for heterogeneity in patient preferences. We propose a new paradigm for the study of precision medicine using observational data that relies solely on the assumption that clinicians are approximately (i.e., imperfectly) making decisions to maximize individual patient utility. Estimated composite outcomes are subsequently used to construct an estimator of an individualized treatment rule which maximizes the mean of patient-specific composite outcomes. The estimated composite outcomes and estimated optimal individualized treatment rule provide new insights into patient preference heterogeneity, clinician behavior, and the value of precision medicine in a given domain. We derive inference procedures for the proposed estimators under mild conditions and demonstrate their finite sample performance through a suite of simulation experiments and an illustrative application to data from a study of bipolar depression.

preprint2020arXiv

Kernel Assisted Learning for Personalized Dose Finding

An individualized dose rule recommends a dose level within a continuous safe dose range based on patient level information such as physical conditions, genetic factors and medication histories. Traditionally, personalized dose finding process requires repeating clinical visits of the patient and frequent adjustments of the dosage. Thus the patient is constantly exposed to the risk of underdosing and overdosing during the process. Statistical methods for finding an optimal individualized dose rule can lower the costs and risks for patients. In this article, we propose a kernel assisted learning method for estimating the optimal individualized dose rule. The proposed methodology can also be applied to all other continuous decision-making problems. Advantages of the proposed method include robustness to model misspecification and capability of providing statistical inference for the estimated parameters. In the simulation studies, we show that this method is capable of identifying the optimal individualized dose rule and produces favorable expected outcomes in the population. Finally, we illustrate our approach using data from a warfarin dosing study for thrombosis patients.

preprint2020arXiv

Missing Data Imputation for Classification Problems

Imputation of missing data is a common application in various classification problems where the feature training matrix has missingness. A widely used solution to this imputation problem is based on the lazy learning technique, $k$-nearest neighbor (kNN) approach. However, most of the previous work on missing data does not take into account the presence of the class label in the classification problem. Also, existing kNN imputation methods use variants of Minkowski distance as a measure of distance, which does not work well with heterogeneous data. In this paper, we propose a novel iterative kNN imputation technique based on class weighted grey distance between the missing datum and all the training data. Grey distance works well in heterogeneous data with missing instances. The distance is weighted by Mutual Information (MI) which is a measure of feature relevance between the features and the class label. This ensures that the imputation of the training data is directed towards improving classification performance. This class weighted grey kNN imputation algorithm demonstrates improved performance when compared to other kNN imputation algorithms, as well as standard imputation algorithms such as MICE and missForest, in imputation and classification problems. These problems are based on simulated scenarios and UCI datasets with various rates of missingness.

preprint2020arXiv

Technical Background for "A Precision Medicine Approach to Develop and Internally Validate Optimal Exercise and Weight Loss Treatments for Overweight and Obese Adults with Knee Osteoarthritis"

We provide additional statistical background for the methodology developed in the clinical analysis of knee osteoarthritis in "A Precision Medicine Approach to Develop and Internally Validate Optimal Exercise and Weight Loss Treatments for Overweight and Obese Adults with Knee Osteoarthritis" (Jiang et al. 2020). Jiang et al. 2020 proposed a pipeline to learn optimal treatment rules with precision medicine models and compared them with zero-order models with a Z-test. The model performance was based on value functions, a scalar that predicts the future reward of each decision rule. The jackknife (i.e., leave-one-out cross validation) method was applied to estimate the value function and its variance of several outcomes in IDEA. IDEA is a randomized clinical trial studying three interventions (exercise (E), dietary weight loss (D), and D+E) on overweight and obese participants with knee osteoarthritis. In this report, we expand the discussion and justification with additional statistical background. We elaborate more on the background of precision medicine, the derivation of the jackknife estimator of value function and its estimated variance, the consistency property of jackknife estimator, as well as additional simulation results that reflect more of the performance of jackknife estimators. We recommend reading Jiang et al. 2020 for clinical application and interpretation of the optimal ITR of knee osteoarthritis as well as the overall understanding of the pipeline and recommend using this article to understand the underlying statistical derivation and methodology.

preprint2016arXiv

Robust Hybrid Learning for Estimating Personalized Dynamic Treatment Regimens

Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage by potentially time-varying patient features and intermediate outcomes observed in previous stages. The complexity, patient heterogeneity and chronicity of many diseases and disorders call for learning optimal DTRs which best dynamically tailor treatment to each individual's response over time. Proliferation of personalized data (e.g., genetic and imaging data) provides opportunities for deep tailoring as well as new challenges for statistical methodology. In this work, we propose a robust hybrid approach referred as Augmented Multistage Outcome-Weighted Learning (AMOL) to integrate outcome-weighted learning and Q-learning to identify optimal DTRs from the Sequential Multiple Assignment Randomization Trials (SMARTs). We generalize outcome weighted learning (O-learning; Zhao et al.~2012) to allow for negative outcomes; we propose methods to reduce variability of weights in O-learning to achieve numeric stability and higher efficiency; finally, for multiple-stage SMART studies, we introduce doubly robust augmentation to machine learning based O-learning to improve efficiency by drawing information from regression model-based Q-learning at each stage. The proposed AMOL remains valid even if the Q-learning model is misspecified. We establish the theoretical properties of AMOL, including the consistency of the estimated rules and the rates of convergence to the optimal value function. The comparative advantage of AMOL over existing methods is demonstrated in extensive simulation studies and applications to two SMART data sets: a two-stage trial for attention deficit and hyperactive disorder (ADHD) and the STAR*D trial for major depressive disorder (MDD).

preprint2015arXiv

Asymptotics for change-point models under varying degrees of mis-specification

Change-point models are widely used by statisticians to model drastic changes in the pattern of observed data. Least squares/maximum likelihood based estimation of change-points leads to curious asymptotic phenomena. When the change-point model is correctly specified, such estimates generally converge at a fast rate ($n$) and are asymptotically described by minimizers of jump process. Under complete mis-specification by a smooth curve, i.e. when a change-point model is fitted to data described by a smooth curve, the rate of convergence slows down to $n^{1/3}$ and the limit distribution changes to that of the minimizer of a continuous Gaussian process. In this paper we provide a bridge between these two extreme scenarios by studying the limit behavior of change-point estimates under varying degrees of model mis-specification by smooth curves, which can be viewed as local alternatives. We find that the limiting regime depends on how quickly the alternatives approach a change-point model. We unravel a family of `intermediate' limits that can transition, at least qualitatively, to the limits in the two extreme scenarios.

preprint2015arXiv

Residual Weighted Learning for Estimating Individualized Treatment Rules

Personalized medicine has received increasing attention among statisticians, computer scientists, and clinical practitioners. A major component of personalized medicine is the estimation of individualized treatment rules (ITRs). Recently, Zhao et al. (2012) proposed outcome weighted learning (OWL) to construct ITRs that directly optimize the clinical outcome. Although OWL opens the door to introducing machine learning techniques to optimal treatment regimes, it still has some problems in performance. In this article, we propose a general framework, called Residual Weighted Learning (RWL), to improve finite sample performance. Unlike OWL which weights misclassification errors by clinical outcomes, RWL weights these errors by residuals of the outcome from a regression fit on clinical covariates excluding treatment assignment. We utilize the smoothed ramp loss function in RWL, and provide a difference of convex (d.c.) algorithm to solve the corresponding non-convex optimization problem. By estimating residuals with linear models or generalized linear models, RWL can effectively deal with different types of outcomes, such as continuous, binary and count outcomes. We also propose variable selection methods for linear and nonlinear rules, respectively, to further improve the performance. We show that the resulting estimator of the treatment rule is consistent. We further obtain a rate of convergence for the difference between the expected outcome using the estimated ITR and that of the optimal treatment rule. The performance of the proposed RWL methods is illustrated in simulation studies and in an analysis of cystic fibrosis clinical trial data.

preprint2014arXiv

Biclustering Via Sparse Clustering

In many situations it is desirable to identify clusters that differ with respect to only a subset of features. Such clusters may represent homogeneous subgroups of patients with a disease, such as cancer or chronic pain. We define a bicluster to be a submatrix U of a larger data matrix X such that the features and observations in U differ from those not contained in U. For example, the observations in U could have different means or variances with respect to the features in U. We propose a general framework for biclustering based on the sparse clustering method of Witten and Tibshirani (2010). We develop a method for identifying features that belong to biclusters. This framework can be used to identify biclusters that differ with respect to the means of the features, the variance of the features, or more general differences. We apply these methods to several simulated and real-world data sets and compare the results of our method with several previously published methods. The results of our method compare favorably with existing methods with respect to both predictive accuracy and computing time.

preprint2013arXiv

Support Vector Regression for Right Censored Data

We develop a unified approach for classification and regression support vector machines for data subject to right censoring. We provide finite sample bounds on the generalization error of the algorithm, prove risk consistency for a wide class of probability measures, and study the associated learning rates. We apply the general methodology to estimation of the (truncated) mean, median, quantiles, and for classification problems. We present a simulation study that demonstrates the performance of the proposed approach.

preprint2012arXiv

Likelihood based inference for current status data on a grid: A boundary phenomenon and an adaptive inference procedure

In this paper, we study the nonparametric maximum likelihood estimator for an event time distribution function at a point in the current status model with observation times supported on a grid of potentially unknown sparsity and with multiple subjects sharing the same observation time. This is of interest since observation time ties occur frequently with current status data. The grid resolution is specified as $cn^{-γ}$ with $c>0$ being a scaling constant and $γ>0$ regulating the sparsity of the grid relative to $n$, the number of subjects. The asymptotic behavior falls into three cases depending on $γ$: regular Gaussian-type asymptotics obtain for $γ<1/3$, nonstandard cube-root asymptotics prevail when $γ>1/3$ and $γ=1/3$ serves as a boundary at which the transition happens. The limit distribution at the boundary is different from either of the previous cases and converges weakly to those obtained with $γ\in(0,1/3)$ and $γ\in(1/3,\infty)$ as $c$ goes to $\infty$ and 0, respectively. This weak convergence allows us to develop an adaptive procedure to construct confidence intervals for the value of the event time distribution at a point of interest without needing to know or estimate $γ$, which is of enormous advantage from the perspective of inference. A simulation study of the adaptive procedure is presented.

preprint2012arXiv

Q-learning with censored data

We develop methodology for a multistage decision problem with flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the optimal Q-function belongs to the approximation space, the expected survival time for policies obtained by the algorithm converges to that of the optimal policy. We simulate a multistage clinical trial with flexible number of stages and apply the proposed censored-Q-learning algorithm to find individualized treatment regimens. The methodology presented in this paper has implications in the design of personalized medicine trials in cancer and in other life-threatening diseases.

preprint2011arXiv

Penalized Q-Learning for Dynamic Treatment Regimes

A dynamic treatment regime effectively incorporates both accrued information and long-term effects of treatment from specially designed clinical trials. As these become more and more popular in conjunction with longitudinal data from clinical studies, the development of statistical inference for optimal dynamic treatment regimes is a high priority. This is very challenging due to the difficulties arising form non-regularities in the treatment effect parameters. In this paper, we propose a new reinforcement learning framework called penalized Q-learning (PQ-learning), under which the non-regularities can be resolved and valid statistical inference established. We also propose a new statistical procedure---individual selection---and corresponding methods for incorporating individual selection within PQ-learning. Extensive numerical studies are presented which compare the proposed methods with existing methods, under a variety of non-regular scenarios, and demonstrate that the proposed approach is both inferentially and computationally superior. The proposed method is demonstrated with the data from a depression clinical trial study.

preprint2011arXiv

Simultaneous critical values for $t$-tests in very high dimensions

This article considers the problem of multiple hypothesis testing using $t$-tests. The observed data are assumed to be independently generated conditional on an underlying and unknown two-state hidden model. We propose an asymptotically valid data-driven procedure to find critical values for rejection regions controlling the $k$-familywise error rate ($k$-FWER), false discovery rate (FDR) and the tail probability of false discovery proportion (FDTP) by using one-sample and two-sample $t$-statistics. We only require a finite fourth moment plus some very general conditions on the mean and variance of the population by virtue of the moderate deviations properties of $t$-statistics. A new consistent estimator for the proportion of alternative hypotheses is developed. Simulation studies support our theoretical results and demonstrate that the power of a multiple testing procedure can be substantially improved by using critical values directly, as opposed to the conventional $p$-value approach. Our method is applied in an analysis of the microarray data from a leukemia cancer study that involves testing a large number of hypotheses simultaneously.

Michael R. Kosorok

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces

Discussion of Multiscale Fisher's Independence Test for Multivariate Dependence

Multi-stage optimal dynamic treatment regimes for survival outcomes with dependent censoring

Risk-Adjusted Incidence Modeling on Hierarchical Survival Data with Recurrent Events

The Binary Expansion Randomized Ensemble Test (BERET)

Estimation and Optimization of Composite Outcomes

Kernel Assisted Learning for Personalized Dose Finding

Missing Data Imputation for Classification Problems

Technical Background for "A Precision Medicine Approach to Develop and Internally Validate Optimal Exercise and Weight Loss Treatments for Overweight and Obese Adults with Knee Osteoarthritis"

Robust Hybrid Learning for Estimating Personalized Dynamic Treatment Regimens

Asymptotics for change-point models under varying degrees of mis-specification

Residual Weighted Learning for Estimating Individualized Treatment Rules

Biclustering Via Sparse Clustering

Support Vector Regression for Right Censored Data

Likelihood based inference for current status data on a grid: A boundary phenomenon and an adaptive inference procedure

Q-learning with censored data

Penalized Q-Learning for Dynamic Treatment Regimes

Simultaneous critical values for $t$-tests in very high dimensions