Source author record

Yuan Ji

Yuan Ji appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Computation Machine Learning Genomics Populations and Evolution

Catalog footprint

What is connected

16works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Precision Dose-Finding Design for Phase I Oncology Trials by Integrating Pharmacology Data

Phase I oncology trials aim to identify a safe dose - often the maximum tolerated dose (MTD) - for subsequent studies. Conventional designs focus on population-level toxicity modeling, with recent attention on leveraging pharmacokinetic (PK) data to improve dose selection. We propose the Precision Dose-Finding (PDF) design, a novel Bayesian phase I framework that integrates individual patient PK profiles into the dose-finding process. By incorporating patient-specific PK parameters (such as volume of distribution and elimination rate), PDF models toxicity risk at the individual level, in contrast to traditional methods that ignore inter-patient variability. The trial is structured in two stages: an initial training stage to update model parameters using cohort-based dose escalation, and a subsequent test stage in which doses for new patients are chosen based on each patient's own PK-predicted toxicity probability. This two-stage approach enables truly personalized dose assignment while maintaining rigorous safety oversight. Extensive simulation studies demonstrate the feasibility of PDF and suggest that it provides improved safety and dosing precision relative to the continual reassessment method (CRM). The PDF design thus offers a refined dose-finding strategy that tailors the MTD to individual patients, aligning phase I trials with the ideals of precision medicine.

preprint2026arXiv

Semi-Parametric Bayesian Additive Regression Trees for Risk Prediction with High-Dimensional Epigenetic Signatures and Low-Dimensional Covariates

In the era of precision medicine, genome-wide epigenetic modifications offer rich data that could inform risk prediction. However, these data are high-dimensional and exhibit complex dependence structures, which makes it difficult to jointly model them with low-dimensional covariates when the goal is to obtain interpretable effect estimates for covariate adjustment. Standard Bayesian additive regression trees (BART) provide strong predictive performance but treat all predictors uniformly within the tree ensemble, obscuring the contributions of significant covariates and complicating variable selection in high-dimensional settings. We propose a semi-parametric BART model (spBART) that addresses this limitation by modeling low-dimensional covariates through a parametric component with interpretable coefficients, while capturing complex nonlinear associations among high-dimensional predictors through the tree ensemble. To perform stable variable selection, we develop a cross-validation-based procedure that aggregates posterior inclusion probabilities across folds and applies Bayesian false discovery rate control. We apply the proposed method to a pooled case--control analysis of high-dimensional genome-wide 5-hydroxymethylcytosine profiles derived from circulating cell-free DNA in two multiple myeloma studies ($N = 869$). The approach identifies a parsimonious set of candidate loci and achieves strong out-of-sample discrimination (AUC $= 0.96$) in a held-out validation set. Overall, spBART provides a unified framework for combining interpretable covariate inference with flexible modeling and variable selection in high-dimensional biomedical studies.

preprint2023arXiv

A Class of Dependent Random Distributions Based on Atom Skipping

We propose the Plaid Atoms Model (PAM), a novel Bayesian nonparametric model for grouped data. Founded on an idea of `atom skipping', PAM is part of a well-established category of models that generate dependent random distributions and clusters across multiple groups. Atom skipping referrs to stochastically assigning 0 weights to atoms in an infinite mixture. Deploying atom skipping across groups, PAM produces a dependent clustering pattern with overlapping and non-overlapping clusters across groups. As a result, interpretable posterior inference is possible such as reporting the posterior probability of a cluster being exclusive to a single group or shared among a subset of groups. We discuss the theoretical properties of the proposed and related models. Minor extensions of the proposed model for multivariate or count data are presented. Simulation studies and applications using real-world datasets illustrate the performance of the new models with comparison to existing models.

preprint2021arXiv

Lessons Learned from the Bayesian Design and Analysis for the BNT162b2 COVID-19 Vaccine Phase 3 Trial

The phase III BNT162b2 mRNA COVID-19 vaccine trial is based on a Bayesian design and analysis, and the main evidence of vaccine efficacy is presented in Bayesian statistics. Confusion and mistakes are produced in the presentation of the Bayesian results. Some key statistics, such as Bayesian credible intervals, are mislabeled and stated as confidence intervals. Posterior probabilities of the vaccine efficacy are not reported as the main results. We illustrate the main differences in the reporting of Bayesian analysis results for a clinical trial and provide four recommendations. We argue that statistical evidence from a Bayesian trial, when presented properly, is easier to interpret and directly addresses the main clinical questions, thereby better supporting regulatory decision making. We also recommend using abbreviation "BI" to represent Bayesian credible intervals as a differentiation to "CI" which stands for confidence interval.

preprint2020arXiv

Consensus Monte Carlo for Random Subsets using Shared Anchors

We present a consensus Monte Carlo algorithm that scales existing Bayesian nonparametric models for clustering and feature allocation to big data. The algorithm is valid for any prior on random subsets such as partitions and latent feature allocation, under essentially any sampling model. Motivated by three case studies, we focus on clustering induced by a Dirichlet process mixture sampling model, inference under an Indian buffet process prior with a binomial sampling model, and with a categorical sampling model. We assess the proposed algorithm with simulation studies and show results for inference with three datasets: an MNIST image dataset, a dataset of pancreatic cancer mutations, and a large set of electronic health records (EHR). Supplementary materials for this article are available online.

preprint2020arXiv

Semiparametric Bayesian Inference for the Transmission Dynamics of COVID-19 with a State-Space Model

The outbreak of Coronavirus Disease 2019 (COVID-19) is an ongoing pandemic affecting over 200 countries and regions. Inference about the transmission dynamics of COVID-19 can provide important insights into the speed of disease spread and the effects of mitigation policies. We develop a novel Bayesian approach to such inference based on a probabilistic compartmental model using data of daily confirmed COVID-19 cases. In particular, we consider a probabilistic extension of the classical susceptible-infectious-recovered model, which takes into account undocumented infections and allows the epidemiological parameters to vary over time. We estimate the disease transmission rate via a Gaussian process prior, which captures nonlinear changes over time without the need of specific parametric assumptions. We utilize a parallel-tempering Markov chain Monte Carlo algorithm to efficiently sample from the highly correlated posterior space. Predictions for future observations are done by sampling from their posterior predictive distributions. Performance of the proposed approach is assessed using simulated datasets. Finally, our approach is applied to COVID-19 data from four states of the United States: Washington, New York, California, and Illinois. An R package BaySIR is made available at https://github.com/tianjianzhou/BaySIR for the public to conduct independent analysis or reproduce the results in this paper.

preprint2019arXiv

PoD-TPI: Probability-of-Decision Toxicity Probability Interval Design to Accelerate Phase I Trials

Cohort-based enrollment can slow down dose-finding trials since the outcomes of the previous cohort must be fully evaluated before the next cohort can be enrolled. This results in frequent suspension of patient enrollment. The issue is exacerbated in recent immune-oncology trials where toxicity outcomes can take a long time to observe. We propose a novel phase I design, the probability-of-decision toxicity probability interval (PoD-TPI) design, to accelerate phase I trials. PoD-TPI enables dose assignment in real-time in the presence of pending toxicity outcomes. With uncertain outcomes, the dose assignment decisions are treated as a random variable, and we calculate the posterior distribution of the decisions. The posterior distribution reflects the variability in the pending outcomes and allows a direct and intuitive evaluation of the confidence of all possible decisions. Optimal decisions are calculated based on 0-1 loss, and extra safety rules are constructed to enforce sufficient protection from exposing patients to risky doses. A new and useful feature of PoD-TPI is that it allows investigators and regulators to balance the trade-off between enrollment speed and making risky decisions by tuning a pair of intuitive design parameters. Through numerical studies, we evaluate the operating characteristics of PoD-TPI and demonstrate that PoD-TPI shortens trial duration and maintains trial safety and efficiency compared to existing time-to-event designs.

preprint2016arXiv

A Bayesian Interval Dose-Finding Design Addressing Ockham's Razor: mTPI-2

There has been an increasing interest in using interval-based Bayesian designs for dose finding, one of which is the modified toxicity probability interval (mTPI) method. We show that the decision rules in mTPI correspond to an optimal rule under a formal Bayesian decision theoretic framework. However, the probability models in mTPI are overly sharpened by the Ockham's razor, which, while in general helps with parsimonious statistical inference, leads to suboptimal decisions in small-sample inference such as dose finding. We propose a new framework that blunts the Ockham's razor, and demonstrate the superior performance of the new method, called mTPI-2. An online web tool is provided for users who can generate the design, conduct clinical trials, and examine operating characteristics of the designs through big data and crowd sourcing.

preprint2016arXiv

An Ensemble EM Algorithm for Bayesian Variable Selection

We study the Bayesian approach to variable selection in the context of linear regression. Motivated by a recent work by Rockova and George (2014), we propose an EM algorithm that returns the MAP estimate of the set of relevant variables. Due to its particular updating scheme, our algorithm can be implemented efficiently without inverting a large matrix in each iteration and therefore can scale up with big data. We also show that the MAP estimate returned by our EM algorithm achieves variable selection consistency even when $p$ diverges with $n$. In practice, our algorithm could get stuck with local modes, a common problem with EM algorithms. To address this issue, we propose an ensemble EM algorithm, in which we repeatedly apply the EM algorithm on a subset of the samples with a subset of the covariates, and then aggregate the variable selection results across those bootstrap replicates. Empirical studies have demonstrated the superior performance of the ensemble EM algorithm.

preprint2016arXiv

Reciprocal Graphical Models for Integrative Gene Regulatory Network Analysis

Constructing gene regulatory networks is a fundamental task in systems biology. We introduce a Gaussian reciprocal graphical model for inference about gene regulatory relationships by integrating mRNA gene expression and DNA level information including copy number and methylation. Data integration allows for inference on the directionality of certain regulatory relationships, which would be otherwise indistinguishable due to Markov equivalence. Efficient inference is developed based on simultaneous equation models. Bayesian model selection techniques are adopted to estimate the graph structure. We illustrate our approach by simulations and two applications in ZODIAC pairwise gene interaction analysis and colon adenocarcinoma pathway analysis.

preprint2015arXiv

A Bayesian feature allocation model for tumor heterogeneity

We develop a feature allocation model for inference on genetic tumor variation using next-generation sequencing data. Specifically, we record single nucleotide variants (SNVs) based on short reads mapped to human reference genome and characterize tumor heterogeneity by latent haplotypes defined as a scaffold of SNVs on the same homologous genome. For multiple samples from a single tumor, assuming that each sample is composed of some sample-specific proportions of these haplotypes, we then fit the observed variant allele fractions of SNVs for each sample and estimate the proportions of haplotypes. Varying proportions of haplotypes across samples is evidence of tumor heterogeneity since it implies varying composition of cell subpopulations. Taking a Bayesian perspective, we proceed with a prior probability model for all relevant unknown quantities, including, in particular, a prior probability model on the binary indicators that characterize the latent haplotypes. Such prior models are known as feature allocation models. Specifically, we define a simplified version of the Indian buffet process, one of the most traditional feature allocation models. The proposed model allows overlapping clustering of SNVs in defining latent haplotypes, which reflects the evolutionary process of subclonal expansion in tumor samples.

preprint2014arXiv

A Latent Gaussian Process Model with Application to Monitoring Clinical Trials

In many clinical trials treatments need to be repeatedly applied as diseases relapse frequently after remission over a long period of time (e.g., 35 weeks). Most research in statistics focuses on the overall trial design, such as sample size and power calculation, or on the data analysis after trials are completed. Little is done to improve the efficiency of trial monitoring, such as early termination of trials due to futility. The challenge faced in such trial monitoring is mostly caused by the need to properly model repeated outcomes from patients. We propose a Bayesian trial monitoring scheme for clinical trials with repeated and potentially cyclic binary outcomes. We construct a latent Gaussian process (LGP) to model discrete longitudinal data in those trials. LGP describes the underlying latent process that gives rise to the observed longitudinal binary outcomes. The posterior consistency property of the proposed model is studied. Posterior inference is conducted with a hybrid Monte Carlo algorithm. Simulation studies are conducted under various clinical scenarios, and a case study is reported based on a real-life trial.

preprint2014arXiv

Bayesian Inference for Tumor Subclones Accounting for Sequencing and Structural Variants

Tumor samples are heterogeneous. They consist of different subclones that are characterized by differences in DNA nucleotide sequences and copy numbers on multiple loci. Heterogeneity can be measured through the identification of the subclonal copy number and sequence at a selected set of loci. Understanding that the accurate identification of variant allele fractions greatly depends on a precise determination of copy numbers, we develop a Bayesian feature allocation model for jointly calling subclonal copy numbers and the corresponding allele sequences for the same loci. The proposed method utilizes three random matrices, L, Z and w to represent subclonal copy numbers (L), numbers of subclonal variant alleles (Z) and cellular fractions of subclones in samples (w), respectively. The unknown number of subclones implies a random number of columns for these matrices. We use next-generation sequencing data to estimate the subclonal structures through inference on these three matrices. Using simulation studies and a real data analysis, we demonstrate how posterior inference on the subclonal structure is enhanced with the joint modeling of both structure and sequencing variants on subclonal genomes. Software is available at http://compgenome.org/BayClone2.

preprint2014arXiv

Bayesian sparse graphical models for classification with application to protein expression data

Reverse-phase protein array (RPPA) analysis is a powerful, relatively new platform that allows for high-throughput, quantitative analysis of protein networks. One of the challenges that currently limit the potential of this technology is the lack of methods that allow for accurate data modeling and identification of related networks and samples. Such models may improve the accuracy of biological sample classification based on patterns of protein network activation and provide insight into the distinct biological relationships underlying different types of cancer. Motivated by RPPA data, we propose a Bayesian sparse graphical modeling approach that uses selection priors on the conditional relationships in the presence of class information. The novelty of our Bayesian model lies in the ability to draw information from the network data as well as from the associated categorical outcome in a unified hierarchical model for classification. In addition, our method allows for intuitive integration of a priori network information directly in the model and allows for posterior inference on the network topologies both within and between classes. Applying our methodology to an RPPA data set generated from panels of human breast cancer and ovarian cancer cell lines, we demonstrate that the model is able to distinguish the different cancer cell types more accurately than several existing models and to identify differential regulation of components of a critical signaling network (the PI3K-AKT pathway) between these two types of cancer. This approach represents a powerful new tool that can be used to improve our understanding of protein networks in cancer.

preprint2014arXiv

MAD Bayes for Tumor Heterogeneity Feature Allocation with Non-Normal Sampling

We propose small-variance asymptotic approximations for the inference of tumor heterogeneity (TH) using next-generation sequencing data. Understanding TH is an important and open research problem in biology. The lack of appropriate statistical inference is a critical gap in existing methods that the proposed approach aims to fill. We build on a hierarchical model with an exponential family likelihood and a feature allocation prior. The proposed approach generalizes similar small-variance approximations proposed by Kulis and Jordan (2012) and Broderick et.al (2012) for inference with Dirichlet process mixture and Indian buffet prior models under normal sampling. We show that the new algorithm can successfully recover latent structures of different subclones and is also magnitude faster than available Markov chain Monte Carlo samplers, the latter often practically infeasible for high-dimensional genomics data. The proposed approach is scalable, simple to implement and benefits from the flexibility of Bayesian nonparametric models. More importantly, it provides a useful tool for the biological community for estimating cell subtypes in tumor samples.

preprint2014arXiv

Subgroup-Based Adaptive (SUBA) Designs for Multi-Arm Biomarker Trials

Targeted therapies based on biomarker profiling are becoming a mainstream direction of cancer research and treatment. Depending on the expression of specific prognostic biomarkers, targeted therapies assign different cancer drugs to subgroups of patients even if they are diagnosed with the same type of cancer by traditional means, such as tumor location. For example, Herceptin is only indicated for the subgroup of patients with HER2+ breast cancer, but not other types of breast cancer. However, subgroups like HER2+ breast cancer with effective targeted therapies are rare and most cancer drugs are still being applied to large patient populations that include many patients who might not respond or benefit. Also, the response to targeted agents in human is usually unpredictable. To address these issues, we propose SUBA, subgroup-based adaptive designs that simultaneously search for prognostic subgroups and allocate patients adaptively to the best subgroup-specific treatments throughout the course of the trial. The main features of SUBA include the continuous reclassification of patient subgroups based on a random partition model and the adaptive allocation of patients to the best treatment arm based on posterior predictive probabilities. We compare the SUBA design with three alternative designs including equal randomization, outcome-adaptive randomization and a design based on a probit regression. In simulation studies we find that SUBA compares favorably against the alternatives.

Yuan Ji

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Precision Dose-Finding Design for Phase I Oncology Trials by Integrating Pharmacology Data

Semi-Parametric Bayesian Additive Regression Trees for Risk Prediction with High-Dimensional Epigenetic Signatures and Low-Dimensional Covariates

A Class of Dependent Random Distributions Based on Atom Skipping

Lessons Learned from the Bayesian Design and Analysis for the BNT162b2 COVID-19 Vaccine Phase 3 Trial

Consensus Monte Carlo for Random Subsets using Shared Anchors

Semiparametric Bayesian Inference for the Transmission Dynamics of COVID-19 with a State-Space Model

PoD-TPI: Probability-of-Decision Toxicity Probability Interval Design to Accelerate Phase I Trials

A Bayesian Interval Dose-Finding Design Addressing Ockham's Razor: mTPI-2

An Ensemble EM Algorithm for Bayesian Variable Selection

Reciprocal Graphical Models for Integrative Gene Regulatory Network Analysis

A Bayesian feature allocation model for tumor heterogeneity

A Latent Gaussian Process Model with Application to Monitoring Clinical Trials

Bayesian Inference for Tumor Subclones Accounting for Sequencing and Structural Variants

Bayesian sparse graphical models for classification with application to protein expression data

MAD Bayes for Tumor Heterogeneity Feature Allocation with Non-Normal Sampling

Subgroup-Based Adaptive (SUBA) Designs for Multi-Arm Biomarker Trials