Researcher profile

Yajuan Si

Yajuan Si contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

Multilevel Regression and Poststratification Interface: An Application to Track Community-level COVID-19 Viral Transmission

We present a novel Bayesian workflow for multilevel regression and poststratification (MRP), introducing extensions to time-varying data and granular geography and publicly available open-source computation tools, facilitating broad research adoption and reproducibility. In the absence of comprehensive or random testing throughout the COVID-19 pandemic, we have developed a proxy method for synthetic random sampling to estimate community-level viral incidence, based on viral RNA testing of asymptomatic patients who present for elective procedures within a hospital system. The approach collects routine testing data on SARS-CoV-2 exposure among outpatients and performs statistical adjustments of sample representation using MRP, a procedure that adjusts for nonrepresentativeness of the sample and yields stable small group estimates. We illustrate the MRP interface with an application to track community-level COVID-19 viral transmission in the state of Michigan.

preprint2022arXiv

A Case Study of Nonresponse Bias Analysis In Educational Assessment Surveys

Nonresponse bias is a widely prevalent problem for data on education. We develop a ten-step exemplar to guide nonresponse bias analysis (NRBA) in cross-sectional studies and apply these steps to the Early Childhood Longitudinal Study, Kindergarten Class of 2010-11. A key step is the construction of indices of nonresponse bias based on proxy pattern-mixture models for survey variables of interest. A novel feature is to characterize the strength of evidence about nonresponse bias contained in these indices, based on the strength of the relationship between the characteristics in the nonresponse adjustment and the key survey variables. Our NRBA improves existing methods by incorporating both missing at random and missing not at random mechanisms, and all analyses can be done straightforwardly with standard statistical software.

preprint2022arXiv

Beyond Vaccination Rates: A Synthetic Random Proxy Metric of Total SARS-CoV-2 Immunity Seroprevalence in the Community

Explicit knowledge of total community-level immune seroprevalence is critical to developing policies to mitigate the social and clinical impact of SARS-CoV-2. Publicly available vaccination data are frequently cited as a proxy for population immunity, but this metric ignores the effects of naturally-acquired immunity, which varies broadly throughout the country and world. Without broad or random sampling of the population, accurate measurement of persistent immunity post natural infection is generally unavailable. To enable tracking of both naturally-acquired and vaccine-induced immunity, we set up a synthetic random proxy based on routine hospital testing for estimating total Immunoglobulin G (IgG) prevalence in the sampled community. Our approach analyzes viral IgG testing data of asymptomatic patients who present for elective procedures within a hospital system. We apply multilevel regression and poststratification to adjust for demographic and geographic discrepancies between the sample and the community population. We then apply state-based vaccination data to categorize immune status as driven by natural infection or by vaccine. We have validated the model using verified clinical metrics of viral and symptomatic disease incidence to show the expected biological correlation of these entities with the timing, rate, and magnitude of seroprevalence. In mid-July 2021, the estimated immunity level was 74% with the administered vaccination rate of 45% in the two counties. The metric improves real-time understanding of immunity to COVID-19 as it evolves and the coordination of policy responses to the disease, toward an inexpensive and easily operational surveillance system that transcends the limits of vaccination datasets alone.

preprint2020arXiv

Bayesian hierarchical weighting adjustment and survey inference

We combine Bayesian prediction and weighted inference as a unified approach to survey inference. The general principles of Bayesian analysis imply that models for survey outcomes should be conditional on all variables that affect the probability of inclusion. We incorporate the weighting variables under the framework of multilevel regression and poststratification, as a byproduct generating model-based weights after smoothing. We investigate deep interactions and introduce structured prior distributions for smoothing and stability of estimates. The computation is done via Stan and implemented in the open source R package "rstanarm" ready for public use. Simulation studies illustrate that model-based prediction and weighting inference outperform classical weighting. We apply the proposal to the New York Longitudinal Study of Wellbeing. The new approach generates robust weights and increases efficiency for finite population inference, especially for subsets of the population.

preprint2020arXiv

Bayesian Profiling Multiple Imputation for Missing Electronic Health Records

Electronic health records (EHRs) are increasingly used for clinical and comparative effectiveness research, but suffer from missing data. Motivated by health services research on diabetes care, we seek to increase the quality of EHRs by focusing on missing values of longitudinal glycosylated hemoglobin (A1c), a key risk factor for diabetes complications and adverse events. Under the framework of multiple imputation (MI), we propose an individualized Bayesian latent profiling approach to capture A1c measurement trajectories subject to missingness. The proposed method is applied to EHRs of adult patients with diabetes in a large academic Midwestern health system between 2003 and 2013 and had Medicare A and B coverage. We combine MI inferences to evaluate the association of A1c levels with the incidence of acute adverse health events and examine patient heterogeneity across identified patient profiles. We investigate different missingness mechanisms and perform imputation diagnostics. Our approach is computationally efficient and fits flexible models that provide useful clinical insights.

preprint2019arXiv

Bayes-raking: Bayesian Finite Population Inference with Known Margins

Raking is widely used in categorical data modeling and survey practice but faced with methodological and computational challenges. We develop a Bayesian paradigm for raking by incorporating the marginal constraints as a prior distribution via two main strategies: 1) constructing the solution subspaces via basis functions or projection matrix and 2) modeling soft constraints. The proposed Bayes-raking estimation integrates the models for the margins, the sample selection and response mechanism, and the outcome, with the capability to propagate all sources of uncertainty. Computation is done via Stan, and codes are ready for public use. Simulation studies show that Bayes-raking can perform as well as raking with large samples and outperform in terms of validity and efficiency gains, especially with a sparse contingency table or dependent raking factors. We apply the new method to the Longitudinal Study of Wellbeing study and demonstrate that model-based approaches significantly improve inferential reliability and substantive findings as a unified survey inference framework.

preprint2017arXiv

Bayesian Inference under Cluster Sampling with Probability Proportional to Size

Cluster sampling is common in survey practice, and the corresponding inference has been predominantly design-based. We develop a Bayesian framework for cluster sampling and account for the design effect in the outcome modeling. We consider a two-stage cluster sampling design where the clusters are first selected with probability proportional to cluster size, and then units are randomly sampled inside selected clusters. Challenges arise when the sizes of nonsampled cluster are unknown. We propose nonparametric and parametric Bayesian approaches for predicting the unknown cluster sizes, with this inference performed simultaneously with the model for survey outcome. Simulation studies show that the integrated Bayesian approach outperforms classical methods with efficiency gains. We use Stan for computing and apply the proposal to the Fragile Families and Child Wellbeing study as an illustration of complex survey inference in health surveys.

preprint2015arXiv

Bayesian Nonparametric Weighted Sampling Inference

It has historically been a challenge to perform Bayesian inference in a design-based survey context. The present paper develops a Bayesian model for sampling inference in the presence of inverse-probability weights. We use a hierarchical approach in which we model the distribution of the weights of the nonsampled units in the population and simultaneously include them as predictors in a nonparametric Gaussian process regression. We use simulation studies to evaluate the performance of our procedure and compare it to the classical design-based estimator. We apply our method to the Fragile Family and Child Wellbeing Study. Our studies find the Bayesian nonparametric finite population estimator to be more robust than the classical design-based estimator without loss in efficiency, which works because we induce regularization for small cells and thus this is a way of automatically smoothing the highly variable weights.