Researcher profile

Hongyu Zhao

Hongyu Zhao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.

preprint2026arXiv

Improve Power of Knockoffs with Annotation Information of Covariates

Genome-wide association studies (GWAS) often find association signals between many genetic variants and traits of interest in a genomic region. Functional annotations of these variants provide valuable prior information that helps prioritize biologically relevant variants and enhances the power to detect causal variants. However, due to substantial correlations among these variants, a critical question is how to rigorously control the false discovery rate while effectively leveraging prior knowledge. We introduce annotation-informed knockoffs (AnnoKn), a knockoff-based method that performs annotation-informed variable selection with strict control of the false discovery rate. AnnoKn integrates the knockoff procedure with adaptive Lasso regression to evaluate the importance of multiple covariates while incorporating functional annotation information within a unified Bayesian framework. To facilitate real-world applications where individual-level data are not accessible, we further extend AnnoKn to operate on summary statistics. Through simulations and real-world applications to GTEx and GWAS datasets, we show that AnnoKn achieves superior power in detecting causal genetic variants compared with existing annotation-informed variable selection methods, while maintaining valid control over false discoveries.

preprint2022arXiv

Statistical Inference of Cell-type Proportions Estimated from Bulk Expression Data

There is a growing interest in cell-type-specific analysis from bulk samples with a mixture of different cell types. A critical first step in such analyses is the accurate estimation of cell-type proportions in a bulk sample. Although many methods have been proposed recently, quantifying the uncertainties associated with the estimated cell-type proportions has not been well studied. Lack of consideration of these uncertainties can lead to missed or false findings in downstream analyses. In this article, we introduce a flexible statistical deconvolution framework that allows a general and subject-specific covariance of bulk gene expressions. Under this framework, we propose a decorrelated constrained least squares method called DECALS that estimates cell-type proportions as well as the sampling distribution of the estimates. Simulation studies demonstrate that DECALS can accurately quantify the uncertainties in the estimated proportions whereas other methods fail. Applying DECALS to analyze bulk gene expression data of post mortem brain samples from the ROSMAP and GTEx projects, we show that taking into account the uncertainties in the estimated cell-type proportions can lead to more accurate identifications of cell-type-specific differentially expressed genes and transcripts between different subject groups, such as between Alzheimer's disease patients and controls and between males and females.

preprint2021arXiv

A general kernel boosting framework integrating pathways for predictive modeling based on genomic data

Predictive modeling based on genomic data has gained popularity in biomedical research and clinical practice by allowing researchers and clinicians to identify biomarkers and tailor treatment decisions more efficiently. Analysis incorporating pathway information can boost discovery power and better connect new findings with biological mechanisms. In this article, we propose a general framework, Pathway-based Kernel Boosting (PKB), which incorporates clinical information and prior knowledge about pathways for prediction of binary, continuous and survival outcomes. We introduce appropriate loss functions and optimization procedures for different outcome types. Our prediction algorithm incorporates pathway knowledge by constructing kernel function spaces from the pathways and use them as base learners in the boosting procedure. Through extensive simulations and case studies in drug response and cancer survival datasets, we demonstrate that PKB can substantially outperform other competing methods, better identify biological pathways related to drug response and patient survival, and provide novel insights into cancer pathogenesis and treatment response.

preprint2021arXiv

Variance Estimation and Confidence Intervals from High-dimensional Genome-wide Association Studies Through Misspecified Mixed Model Analysis

We study variance estimation and associated confidence intervals for parameters characterizing genetic effects from genome-wide association studies (GWAS) misspecified mixed model analysis. Previous studies have shown that, in spite of the model misspecification, certain quantities of genetic interests are estimable, and consistent estimators of these quantities can be obtained using the restricted maximum likelihood (REML) method under a misspecified linear mixed model. However, the asymptotic variance of such a REML estimator is complicated and not ready to be implemented for practical use. In this paper, we develop practical and computationally convenient methods for estimating such asymptotic variances and constructing the associated confidence intervals. Performance of the proposed methods is evaluated empirically based on Monte-Carlo simulations and real-data application.

preprint2020arXiv

A Hidden Markov Model Based Unsupervised Algorithm for Sleep/Wake Identification Using Actigraphy

Actigraphy is widely used in sleep studies but lacks a universal unsupervised algorithm for sleep/wake identification. In this study, we proposed a Hidden Markov Model (HMM) based unsupervised algorithm that can automatically and effectively infer sleep/wake states. It is an individualized data-driven approach that analyzes actigraphy from each individual respectively to learn activity characteristics and further separate sleep and wake states. We used Actiwatch and polysomnography (PSG) data from 43 individuals in the Multi-Ethnic Study of Atherosclerosis to evaluate the performance of our method. Epoch-by-epoch comparisons were made between our HMM algorithm and that embedded in the Actiwatch software (AS). The percent agreement between HMM and PSG was 85.7%, and that between AS and PSG was 84.7%. Positive predictive values for sleep epochs were 85.6% and 84.6% for HMM and AS, respectively, and 95.5% and 85.6% for wake epochs. Both methods have similar performance and tend to overestimate sleep and underestimate wake compared to PSG. Our HMM approach is able to quantify the variability in activity counts that allow us to differentiate relatively active and sedentary individuals: individuals with higher estimated variabilities tend to show more frequent sedentary behaviors. In conclusion, our unsupervised data-driven HMM algorithm achieves slightly better performance compared to the commonly used algorithm in the Actiwatch software. HMM can help expand the application of actigraphy in large-scale studies and in cases where intrusive PSG is hard to acquire or unavailable. In addition, the estimated HMM parameters can characterize individual activity patterns that can be utilized for further analysis.

preprint2020arXiv

A set of efficient methods to generate high-dimensional binary data with specified correlation structures

High dimensional correlated binary data arise in many areas, such as observed genetic variations in biomedical research. Data simulation can help researchers evaluate efficiency and explore properties of different computational and statistical methods. Also, some statistical methods, such as Monte-Carlo methods, rely on data simulation. Lunn and Davies (1998) proposed linear time complexity methods to generate correlated binary variables with three common correlation structures. However, it is infeasible to specify unequal probabilities in their methods. In this manuscript, we introduce several computationally efficient algorithms that generate high-dimensional binary data with specified correlation structures and unequal probabilities. Our algorithms have linear time complexity with respect to the dimension for three commonly studied correlation structures, namely exchangeable, decaying-product and K-dependent correlation structures. In addition, we extend our algorithms to generate binary data of general non-negative correlation matrices with quadratic time complexity. We provide an R package, CorBin, to implement our simulation methods. Compared to the existing packages for binary data generation, the time cost to generate a 100-dimensional binary vector with the common correlation structures and general correlation matrices can be reduced up to $10^5$ folds and $10^3$ folds, respectively, and the efficiency can be further improved with the increase of dimensions. The R package CorBin is available on CRAN at https://cran.r-project.org/.

preprint2020arXiv

Inference of Dynamic Graph Changes for Functional Connectome

Dynamic functional connectivity is an effective measure for the brain's responses to continuous stimuli. We propose an inferential method to detect the dynamic changes of brain networks based on time-varying graphical models. Whereas most existing methods focus on testing the existence of change points, the dynamics in the brain network offer more signals in many neuroscience studies. We propose a novel method to conduct hypothesis testing on changes in dynamic brain networks. We introduce a bootstrap statistic to approximate the supreme of the high-dimensional empirical processes over dynamically changing edges. Our simulations show that this framework can capture the change points with changed connectivity. Finally, we apply our method to a brain imaging dataset under a natural audio-video stimulus and illustrate that we are able to detect temporal changes in brain networks. The functions of the identified regions are consistent with specific emotional annotations, which are closely associated with changes inferred by our method.