Source author record

Tianxi Cai

Tianxi Cai appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning math.ST Statistics Theory Artificial Intelligence Computation and Language Applications Social and Information Networks

Catalog footprint

What is connected

15works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Toward Global Large Language Models in Medicine

Despite continuous advances in medical technology, the global distribution of health care resources remains uneven. The development of large language models (LLMs) has transformed the landscape of medicine and holds promise for improving health care quality and expanding access to medical information globally. However, existing LLMs are primarily trained on high-resource languages, limiting their applicability in global medical scenarios. To address this gap, we constructed GlobMed, a large multilingual medical dataset, containing over 500,000 entries spanning 12 languages, including four low-resource languages. Building on this, we established GlobMed-Bench, which systematically assesses 56 state-of-the-art proprietary and open-weight LLMs across multiple multilingual medical tasks, revealing significant performance disparities across languages, particularly for low-resource languages. Additionally, we introduced GlobMed-LLMs, a suite of multilingual medical LLMs trained on GlobMed, with parameters ranging from 1.7B to 8B. GlobMed-LLMs achieved an average performance improvement of over 40% relative to baseline models, with a more than threefold increase in performance on low-resource languages. Together, these resources provide an important foundation for advancing the equitable development and application of LLMs globally, enabling broader language communities to benefit from technological advances.

preprint2023arXiv

Robust Inference for Federated Meta-Learning

Synthesizing information from multiple data sources is critical to ensure knowledge generalizability. Integrative analysis of multi-source data is challenging due to the heterogeneity across sources and data-sharing constraints due to privacy concerns. In this paper, we consider a general robust inference framework for federated meta-learning of data from multiple sites, enabling statistical inference for the prevailing model, defined as the one matching the majority of the sites. Statistical inference for the prevailing model is challenging since it requires a data-adaptive mechanism to select eligible sites and subsequently account for the selection uncertainty. We propose a novel sampling method to address the additional variation arising from the selection. Our devised CI construction does not require sites to share individual-level data and is shown to be valid without requiring the selection of eligible sites to be error-free. The proposed robust inference for federated meta-learning (RIFL) methodology is broadly applicable and illustrated with three inference problems: aggregation of parametric models, high-dimensional prediction models, and inference for average treatment effects. We use RIFL to perform federated learning of mortality risk for patients hospitalized with COVID-19 using real-world EHR data from 16 healthcare centers representing 275 hospitals across four countries.

preprint2022arXiv

Augmented Transfer Regression Learning with Semi-non-parametric Nuisance Models

In contemporary statistical learning, covariate shift correction plays an important role in transfer learning when distribution of the testing data is shifted from the training data. Importance weighting, as a natural and principle strategy to adjust for covariate shift, has been commonly used in the field of transfer learning. However, this strategy is not robust to model misspecification or excessive estimation error. In this paper, we propose an augmented transfer regression learning (ATReL) approach that introduces an imputation model for the targeted response, and uses it to augment the importance weighting equation. With novel semi-non-parametric constructions and calibrated moment estimating equations for the two nuisance models, our ATReL method is less prone to (i) the curse of dimensionality compared to nonparametric approaches, and (ii) model mis-specification than parametric approaches. We show that our ATReL estimator is root-n-consistent when at least one nuisance model is correctly specified, estimation for the parametric part of the nuisance models achieves parametric rate, and the nonparametric components are rate doubly robust. Simulation studies demonstrate that our method is more robust and efficient than existing parametric and fully nonparametric (machine learning) estimators under various configurations. We also examine the utility of our method through a real example about transfer learning of phenotyping algorithm for rheumatoid arthritis across different time windows. Finally, we propose ways to enhance the intrinsic efficiency of our estimator and to incorporate modern machine learning methods with our proposed framework.

preprint2022arXiv

Multimodal Learning on Graphs for Disease Relation Extraction

Objective: Disease knowledge graphs are a way to connect, organize, and access disparate information about diseases with numerous benefits for artificial intelligence (AI). To create knowledge graphs, it is necessary to extract knowledge from multimodal datasets in the form of relationships between disease concepts and normalize both concepts and relationship types. Methods: We introduce REMAP, a multimodal approach for disease relation extraction and classification. The REMAP machine learning approach jointly embeds a partial, incomplete knowledge graph and a medical language dataset into a compact latent vector space, followed by aligning the multimodal embeddings for optimal disease relation extraction. Results: We apply REMAP approach to a disease knowledge graph with 96,913 relations and a text dataset of 1.24 million sentences. On a dataset annotated by human experts, REMAP improves text-based disease relation extraction by 10.0% (accuracy) and 17.2% (F1-score) by fusing disease knowledge graphs with text information. Further, REMAP leverages text information to recommend new relationships in the knowledge graph, outperforming graph-based methods by 8.4% (accuracy) and 10.4% (F1-score). Conclusion: REMAP is a multimodal approach for extracting and classifying disease relationships by fusing structured knowledge and text information. REMAP provides a flexible neural architecture to easily find, access, and validate AI-driven relationships between disease concepts.

preprint2021arXiv

Risk Prediction with Imperfect Survival Outcome Information from Electronic Health Records

Readily available proxies for time of disease onset such as time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status of the disease by a follow up time rather than the exact time. In this paper, we aim to develop risk prediction models for the onset time efficiently leveraging both a small number of labels on current status and a large number of unlabeled observations on imperfect proxies. Under a semiparametric transformation model for onset and a highly flexible measurement error models for proxy onset time, we propose the semisupervised risk prediction method by combining information from proxies and limited labels efficiently. From an initial estimator solely based on the labelled subset, we perform a one-step correction with the full data augmenting against a mean zero rank correlation score derived from the proxies. We establish the consistency and asymptotic normality of the proposed semi-supervised estimator and provide a resampling procedure for interval estimation. Simulation studies demonstrate that the proposed estimator performs well in finite sample. We illustrate the proposed estimator by developing a genetic risk prediction model for obesity using data from Partners Biobank Electronic Health Records (EHR).

preprint2021arXiv

Semi-Supervised Off Policy Reinforcement Learning

Reinforcement learning (RL) has shown great success in estimating sequential treatment strategies which take into account patient heterogeneity. However, health-outcome information, which is used as the reward for reinforcement learning methods, is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource intensive task, so most of the available well-annotated cohorts are small. To address this issue, we propose a semi-supervised learning (SSL) approach that efficiently leverages a small sized labeled data with true outcome observed, and a large unlabeled data with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to Q-learning and doubly robust off policy value estimation. Generalizing SSL to sequential treatment regimes brings interesting challenges: 1) Feature distribution for Q-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage in the modified SSL framework are predictive of the outcome but not informative to the optimal policy or value function. We provide theoretical results for our Q-function and value function estimators to understand to what degree efficiency can be gained from SSL. Our method is at least as efficient as the supervised approach, and moreover safe as it robust to mis-specification of the imputation models.

preprint2020arXiv

Individual Data Protected Integrative Regression Analysis of High-dimensional Heterogeneous Data

Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high dimensional setting. The challenge is even more pronounced when the individual level data cannot be shared across studies, known as DataSHIELD constraint (Wolfson et al., 2010). Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.

preprint2020arXiv

Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints

Identifying informative predictors in a high dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improve power is through meta-analyzing multiple studies on the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data but not individual level data can be shared across different sites. In this paper, we propose a novel data shielding integrative large-scale testing (DSILT) approach to signal detection by allowing between study heterogeneity and not requiring sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the DSILT approach incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling for false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the DSILT procedure with the ideal individual--level meta--analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the DSILT procedure performs well in both false discovery control and attaining power. The proposed method is applied to a real example on detecting interaction effect of the genetic variants for statins and obesity on the risk for Type 2 Diabetes.

preprint2020arXiv

Optimal Statistical Inference for Individualized Treatment Effects in High-dimensional Models

The ability to predict individualized treatment effects (ITEs) based on a given patient's profile is essential for personalized medicine. We propose a hypothesis testing approach to choosing between two potential treatments for a given individual in the framework of high-dimensional linear models. The methodological novelty lies in the construction of a debiased estimator of the ITE and establishment of its asymptotic normality uniformly for an arbitrary future high-dimensional observation, while the existing methods can only handle certain specific forms of observations. We introduce a testing procedure with the type-I error controlled and establish its asymptotic power. The proposed method can be extended to making inference for general linear contrasts, including both the average treatment effect and outcome prediction. We introduce the optimality framework for hypothesis testing from both the minimaxity and adaptivity perspectives and establish the optimality of the proposed procedure. An extension to high-dimensional approximate linear models is also considered. The finite sample performance of the procedure is demonstrated in simulation studies and further illustrated through an analysis of electronic health records data from patients with rheumatoid arthritis.

preprint2016arXiv

Evaluating Surrogate Marker Information using Censored Data

Given the long follow-up periods that are often required for treatment or intervention studies, the potential to use surrogate markers to decrease the required follow-up time is a very attractive goal. However, previous studies have shown that using inadequate markers or making inappropriate assumptions about the relationship between the primary outcome and surrogate marker can lead to inaccurate conclusions regarding the treatment effect. Currently available methods for identifying and validating surrogate markers tend to rely on restrictive model assumptions and/or focus on uncensored outcomes. The ability to use such methods in practice when the primary outcome of interest is a time-to-event outcome is difficult due to censoring and missing surrogate information among those who experience the primary outcome before surrogate marker measurement. In this paper, we propose a novel definition of the proportion of treatment effect explained by surrogate information collected up to a specified time in the setting of a time-to-event primary outcome. Our proposed approach accommodates a setting where individuals may experience the primary outcome before the surrogate marker is measured. We propose a robust nonparametric procedure to estimate the defined quantity using censored data and use a perturbation-resampling procedure for variance estimation. Simulation studies demonstrate that the proposed procedures perform well in finite samples. We illustrate the proposed procedures by investigating two potential surrogate markers for diabetes using data from the Diabetes Prevention Program.

preprint2016arXiv

L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs

It is known that for a certain class of single index models (SIMs) $Y = f(\boldsymbol{X}_{p \times 1}^\intercal\boldsymbolβ_0, \varepsilon)$, support recovery is impossible when $\boldsymbol{X} \sim \mathcal{N}(0, \mathbb{I}_{p \times p})$ and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested. These algorithms work provably under the assumption that the design $\boldsymbol{X}$ comes from an i.i.d. Gaussian distribution. In the present paper we analyze algorithms based on covariance screening and least squares with $L_1$ penalization (i.e. LASSO) and demonstrate that they can also enjoy optimal (up to a scalar) rescaled sample size in terms of support recovery, albeit under slightly different assumptions on $f$ and $\varepsilon$ compared to the SIR based algorithms. Furthermore, we show more generally, that LASSO succeeds in recovering the signed support of $\boldsymbolβ_0$ if $\boldsymbol{X} \sim \mathcal{N}(0, \boldsymbolΣ)$, and the covariance $\boldsymbolΣ$ satisfies the irrepresentable condition. Our work extends existing results on the support recovery of LASSO for the linear model, to a more general class of SIMs.

preprint2015arXiv

Estimation and testing for multiple regulation of multivariate mixed outcomes

Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify {\em multiple regulators} or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully-observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.

preprint2015arXiv

Structured Matrix Completion with Applications to Genomic Data Integration

Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival.

preprint2012arXiv

Nonconcave penalized composite conditional likelihood estimation of sparse Ising models

The Ising model is a useful tool for studying complex interactions within a system. The estimation of such a model, however, is rather challenging, especially in the presence of high-dimensional parameters. In this work, we propose efficient procedures for learning a sparse Ising model based on a penalized composite conditional likelihood with nonconcave penalties. Nonconcave penalized likelihood estimation has received a lot of attention in recent years. However, such an approach is computationally prohibitive under high-dimensional Ising models. To overcome such difficulties, we extend the methodology and theory of nonconcave penalized likelihood to penalized composite conditional likelihood estimation. The proposed method can be efficiently implemented by taking advantage of coordinate-ascent and minorization--maximization principles. Asymptotic oracle properties of the proposed method are established with NP-dimensionality. Optimality of the computed local solution is discussed. We demonstrate its finite sample performance via simulation studies and further illustrate our proposal by studying the Human Immunodeficiency Virus type 1 protease structure based on data from the Stanford HIV drug resistance database. Our statistical learning results match the known biological findings very well, although no prior biological information is used in the data analysis procedure.

preprint2010arXiv

Nonparametric inference procedure for percentiles of the random effects distribution in meta-analysis

To investigate whether treating cancer patients with erythropoiesis-stimulating agents (ESAs) would increase the mortality risk, Bennett et al. [Journal of the American Medical Association 299 (2008) 914--924] conducted a meta-analysis with the data from 52 phase III trials comparing ESAs with placebo or standard of care. With a standard parametric random effects modeling approach, the study concluded that ESA administration was significantly associated with increased average mortality risk. In this article we present a simple nonparametric inference procedure for the distribution of the random effects. We re-analyzed the ESA mortality data with the new method. Our results about the center of the random effects distribution were markedly different from those reported by Bennett et al. Moreover, our procedure, which estimates the distribution of the random effects, as opposed to just a simple population average, suggests that the ESA may be beneficial to mortality for approximately a quarter of the study populations. This new meta-analysis technique can be implemented with study-level summary statistics. In contrast to existing methods for parametric random effects models, the validity of our proposal does not require the number of studies involved to be large. From the results of an extensive numerical study, we find that the new procedure performs well even with moderate individual study sample sizes.

Tianxi Cai

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Toward Global Large Language Models in Medicine

Robust Inference for Federated Meta-Learning

Augmented Transfer Regression Learning with Semi-non-parametric Nuisance Models

Multimodal Learning on Graphs for Disease Relation Extraction

Risk Prediction with Imperfect Survival Outcome Information from Electronic Health Records

Semi-Supervised Off Policy Reinforcement Learning

Individual Data Protected Integrative Regression Analysis of High-dimensional Heterogeneous Data

Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints

Optimal Statistical Inference for Individualized Treatment Effects in High-dimensional Models

Evaluating Surrogate Marker Information using Censored Data

L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs

Estimation and testing for multiple regulation of multivariate mixed outcomes

Structured Matrix Completion with Applications to Genomic Data Integration

Nonconcave penalized composite conditional likelihood estimation of sparse Ising models

Nonparametric inference procedure for percentiles of the random effects distribution in meta-analysis