Researcher profile

Michael R. Kosorok

Michael R. Kosorok contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces

We study estimation of the conditional law $P(Y|X=x)$ and continuous functionals $Ψ(P(Y|X=x))$ when $Y$ takes values in a locally compact Polish space, $X \in \mathbb{R}^p$, and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.

preprint2022arXiv

Discussion of Multiscale Fisher's Independence Test for Multivariate Dependence

The multiscale Fisher's independence test (MULTIFIT hereafter) proposed by Gorsky & Ma (2022) is a novel method to test independence between two random vectors. By its design, this test is particularly useful in detecting local dependence. Moreover, by adopting a resampling-free approach, it can easily accommodate massive sample sizes. Another benefit of the proposed method is its ability to interpret the nature of dependency. We congratulate the authors, Shai Gorksy and Li Ma, for their very interesting and elegant work. In this comment, we would like to discuss a general framework unifying the MULTIFIT and other tests and compare it with the binary expansion randomized ensemble test (BERET hereafter) proposed by Lee et al. (In press). We also would like to contribute our thoughts on potential extensions of the method.

preprint2022arXiv

Multi-stage optimal dynamic treatment regimes for survival outcomes with dependent censoring

We propose a reinforcement learning method for estimating an optimal dynamic treatment regime for survival outcomes with dependent censoring. The estimator allows the failure time to be conditionally independent of censoring and dependent on the treatment decision times, supports a flexible number of treatment arms and treatment stages, and can maximize either the mean survival time or the survival probability at a certain time point. The estimator is constructed using generalized random survival forests and can have polynomial rates of convergence. Simulations and data analysis results suggest that the new estimator brings higher expected outcomes than existing methods in various settings. An R package dtrSurv is available on CRAN.

preprint2022arXiv

Risk-Adjusted Incidence Modeling on Hierarchical Survival Data with Recurrent Events

There is a constant need for many healthcare programs to timely address problems with infection prevention and control (IP&C). For example, pathogens can be transmitted among patients with cystic fibrosis (CF) in both the inpatient and outpatient settings within the healthcare system even with the existing recommended IP&C practices, and these pathogens are often associated with negative clinical outcomes. Because of limited and delayed data sharing, CF programs need a reliable method to track infection rates. There are three complex structures in CF registry data: recurrent infections, missing data, and multilevel correlation due to repeated measures within a patient and patient-to-patient transmissions. A step-by-step analysis pipeline was proposed to develop and validate a risk-adjusted model to help healthcare programs monitor the number of recurrent events while taking into account missing data and the hierarchies of repeated measures in right-censored data. We extended the mixed-effect Andersen-Gill model (the frailty model), adjusted for important risk factors, and provided confidence intervals for the predicted number of events where the variability of the prediction was estimated from three identified sources. The coverage of the estimated confidence intervals was used to evaluate model performance. Simulation results indicated that the coverage of our method was close to the desired confidence level. To demonstrate its clinical practicality, our pipeline was applied to monitor the infection incidence rate of two key CF pathogens using a U.S. registry. Results showed that years closer to the time of interest were better at predicting future incidence rates in the CF example.

preprint2021arXiv

The Binary Expansion Randomized Ensemble Test (BERET)

Recently, the binary expansion testing framework was introduced to test the independence of two continuous random variables by utilizing symmetry statistics that are complete sufficient statistics for dependence. We develop a new test based on an ensemble approach that uses the sum of squared symmetry statistics and distance correlation. Simulation studies suggest that this method improves the power while preserving the clear interpretation of the binary expansion testing. We extend this method to tests of independence of random vectors in arbitrary dimension. Through random projections, the proposed binary expansion randomized ensemble test transforms the multivariate independence testing problem into a univariate problem. Simulation studies and data example analyses show that the proposed method provides relatively robust performance compared with existing methods.

preprint2020arXiv

Estimation and Optimization of Composite Outcomes

There is tremendous interest in precision medicine as a means to improve patient outcomes by tailoring treatment to individual characteristics. An individualized treatment rule formalizes precision medicine as a map from patient information to a recommended treatment. A treatment rule is defined to be optimal if it maximizes the mean of a scalar outcome in a population of interest, e.g., symptom reduction. However, clinical and intervention scientists often must balance multiple and possibly competing outcomes, e.g., symptom reduction and the risk of an adverse event. One approach to precision medicine in this setting is to elicit a composite outcome which balances all competing outcomes; unfortunately, eliciting a composite outcome directly from patients is difficult without a high-quality instrument, and an expert-derived composite outcome may not account for heterogeneity in patient preferences. We propose a new paradigm for the study of precision medicine using observational data that relies solely on the assumption that clinicians are approximately (i.e., imperfectly) making decisions to maximize individual patient utility. Estimated composite outcomes are subsequently used to construct an estimator of an individualized treatment rule which maximizes the mean of patient-specific composite outcomes. The estimated composite outcomes and estimated optimal individualized treatment rule provide new insights into patient preference heterogeneity, clinician behavior, and the value of precision medicine in a given domain. We derive inference procedures for the proposed estimators under mild conditions and demonstrate their finite sample performance through a suite of simulation experiments and an illustrative application to data from a study of bipolar depression.

preprint2020arXiv

Kernel Assisted Learning for Personalized Dose Finding

An individualized dose rule recommends a dose level within a continuous safe dose range based on patient level information such as physical conditions, genetic factors and medication histories. Traditionally, personalized dose finding process requires repeating clinical visits of the patient and frequent adjustments of the dosage. Thus the patient is constantly exposed to the risk of underdosing and overdosing during the process. Statistical methods for finding an optimal individualized dose rule can lower the costs and risks for patients. In this article, we propose a kernel assisted learning method for estimating the optimal individualized dose rule. The proposed methodology can also be applied to all other continuous decision-making problems. Advantages of the proposed method include robustness to model misspecification and capability of providing statistical inference for the estimated parameters. In the simulation studies, we show that this method is capable of identifying the optimal individualized dose rule and produces favorable expected outcomes in the population. Finally, we illustrate our approach using data from a warfarin dosing study for thrombosis patients.

preprint2020arXiv

Missing Data Imputation for Classification Problems

Imputation of missing data is a common application in various classification problems where the feature training matrix has missingness. A widely used solution to this imputation problem is based on the lazy learning technique, $k$-nearest neighbor (kNN) approach. However, most of the previous work on missing data does not take into account the presence of the class label in the classification problem. Also, existing kNN imputation methods use variants of Minkowski distance as a measure of distance, which does not work well with heterogeneous data. In this paper, we propose a novel iterative kNN imputation technique based on class weighted grey distance between the missing datum and all the training data. Grey distance works well in heterogeneous data with missing instances. The distance is weighted by Mutual Information (MI) which is a measure of feature relevance between the features and the class label. This ensures that the imputation of the training data is directed towards improving classification performance. This class weighted grey kNN imputation algorithm demonstrates improved performance when compared to other kNN imputation algorithms, as well as standard imputation algorithms such as MICE and missForest, in imputation and classification problems. These problems are based on simulated scenarios and UCI datasets with various rates of missingness.

preprint2020arXiv

Technical Background for "A Precision Medicine Approach to Develop and Internally Validate Optimal Exercise and Weight Loss Treatments for Overweight and Obese Adults with Knee Osteoarthritis"

We provide additional statistical background for the methodology developed in the clinical analysis of knee osteoarthritis in "A Precision Medicine Approach to Develop and Internally Validate Optimal Exercise and Weight Loss Treatments for Overweight and Obese Adults with Knee Osteoarthritis" (Jiang et al. 2020). Jiang et al. 2020 proposed a pipeline to learn optimal treatment rules with precision medicine models and compared them with zero-order models with a Z-test. The model performance was based on value functions, a scalar that predicts the future reward of each decision rule. The jackknife (i.e., leave-one-out cross validation) method was applied to estimate the value function and its variance of several outcomes in IDEA. IDEA is a randomized clinical trial studying three interventions (exercise (E), dietary weight loss (D), and D+E) on overweight and obese participants with knee osteoarthritis. In this report, we expand the discussion and justification with additional statistical background. We elaborate more on the background of precision medicine, the derivation of the jackknife estimator of value function and its estimated variance, the consistency property of jackknife estimator, as well as additional simulation results that reflect more of the performance of jackknife estimators. We recommend reading Jiang et al. 2020 for clinical application and interpretation of the optimal ITR of knee osteoarthritis as well as the overall understanding of the pipeline and recommend using this article to understand the underlying statistical derivation and methodology.