Source author record

Hongyu Zhao

Hongyu Zhao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Genomics math.ST Statistics Theory Machine Learning Quantitative Methods Artificial Intelligence Computation Computer Vision Populations and Evolution

Catalog footprint

What is connected

31works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.

preprint2026arXiv

Improve Power of Knockoffs with Annotation Information of Covariates

Genome-wide association studies (GWAS) often find association signals between many genetic variants and traits of interest in a genomic region. Functional annotations of these variants provide valuable prior information that helps prioritize biologically relevant variants and enhances the power to detect causal variants. However, due to substantial correlations among these variants, a critical question is how to rigorously control the false discovery rate while effectively leveraging prior knowledge. We introduce annotation-informed knockoffs (AnnoKn), a knockoff-based method that performs annotation-informed variable selection with strict control of the false discovery rate. AnnoKn integrates the knockoff procedure with adaptive Lasso regression to evaluate the importance of multiple covariates while incorporating functional annotation information within a unified Bayesian framework. To facilitate real-world applications where individual-level data are not accessible, we further extend AnnoKn to operate on summary statistics. Through simulations and real-world applications to GTEx and GWAS datasets, we show that AnnoKn achieves superior power in detecting causal genetic variants compared with existing annotation-informed variable selection methods, while maintaining valid control over false discoveries.

preprint2022arXiv

Statistical Inference of Cell-type Proportions Estimated from Bulk Expression Data

There is a growing interest in cell-type-specific analysis from bulk samples with a mixture of different cell types. A critical first step in such analyses is the accurate estimation of cell-type proportions in a bulk sample. Although many methods have been proposed recently, quantifying the uncertainties associated with the estimated cell-type proportions has not been well studied. Lack of consideration of these uncertainties can lead to missed or false findings in downstream analyses. In this article, we introduce a flexible statistical deconvolution framework that allows a general and subject-specific covariance of bulk gene expressions. Under this framework, we propose a decorrelated constrained least squares method called DECALS that estimates cell-type proportions as well as the sampling distribution of the estimates. Simulation studies demonstrate that DECALS can accurately quantify the uncertainties in the estimated proportions whereas other methods fail. Applying DECALS to analyze bulk gene expression data of post mortem brain samples from the ROSMAP and GTEx projects, we show that taking into account the uncertainties in the estimated cell-type proportions can lead to more accurate identifications of cell-type-specific differentially expressed genes and transcripts between different subject groups, such as between Alzheimer's disease patients and controls and between males and females.

preprint2021arXiv

A general kernel boosting framework integrating pathways for predictive modeling based on genomic data

Predictive modeling based on genomic data has gained popularity in biomedical research and clinical practice by allowing researchers and clinicians to identify biomarkers and tailor treatment decisions more efficiently. Analysis incorporating pathway information can boost discovery power and better connect new findings with biological mechanisms. In this article, we propose a general framework, Pathway-based Kernel Boosting (PKB), which incorporates clinical information and prior knowledge about pathways for prediction of binary, continuous and survival outcomes. We introduce appropriate loss functions and optimization procedures for different outcome types. Our prediction algorithm incorporates pathway knowledge by constructing kernel function spaces from the pathways and use them as base learners in the boosting procedure. Through extensive simulations and case studies in drug response and cancer survival datasets, we demonstrate that PKB can substantially outperform other competing methods, better identify biological pathways related to drug response and patient survival, and provide novel insights into cancer pathogenesis and treatment response.

preprint2021arXiv

Variance Estimation and Confidence Intervals from High-dimensional Genome-wide Association Studies Through Misspecified Mixed Model Analysis

We study variance estimation and associated confidence intervals for parameters characterizing genetic effects from genome-wide association studies (GWAS) misspecified mixed model analysis. Previous studies have shown that, in spite of the model misspecification, certain quantities of genetic interests are estimable, and consistent estimators of these quantities can be obtained using the restricted maximum likelihood (REML) method under a misspecified linear mixed model. However, the asymptotic variance of such a REML estimator is complicated and not ready to be implemented for practical use. In this paper, we develop practical and computationally convenient methods for estimating such asymptotic variances and constructing the associated confidence intervals. Performance of the proposed methods is evaluated empirically based on Monte-Carlo simulations and real-data application.

preprint2020arXiv

A Hidden Markov Model Based Unsupervised Algorithm for Sleep/Wake Identification Using Actigraphy

Actigraphy is widely used in sleep studies but lacks a universal unsupervised algorithm for sleep/wake identification. In this study, we proposed a Hidden Markov Model (HMM) based unsupervised algorithm that can automatically and effectively infer sleep/wake states. It is an individualized data-driven approach that analyzes actigraphy from each individual respectively to learn activity characteristics and further separate sleep and wake states. We used Actiwatch and polysomnography (PSG) data from 43 individuals in the Multi-Ethnic Study of Atherosclerosis to evaluate the performance of our method. Epoch-by-epoch comparisons were made between our HMM algorithm and that embedded in the Actiwatch software (AS). The percent agreement between HMM and PSG was 85.7%, and that between AS and PSG was 84.7%. Positive predictive values for sleep epochs were 85.6% and 84.6% for HMM and AS, respectively, and 95.5% and 85.6% for wake epochs. Both methods have similar performance and tend to overestimate sleep and underestimate wake compared to PSG. Our HMM approach is able to quantify the variability in activity counts that allow us to differentiate relatively active and sedentary individuals: individuals with higher estimated variabilities tend to show more frequent sedentary behaviors. In conclusion, our unsupervised data-driven HMM algorithm achieves slightly better performance compared to the commonly used algorithm in the Actiwatch software. HMM can help expand the application of actigraphy in large-scale studies and in cases where intrusive PSG is hard to acquire or unavailable. In addition, the estimated HMM parameters can characterize individual activity patterns that can be utilized for further analysis.

preprint2020arXiv

A set of efficient methods to generate high-dimensional binary data with specified correlation structures

High dimensional correlated binary data arise in many areas, such as observed genetic variations in biomedical research. Data simulation can help researchers evaluate efficiency and explore properties of different computational and statistical methods. Also, some statistical methods, such as Monte-Carlo methods, rely on data simulation. Lunn and Davies (1998) proposed linear time complexity methods to generate correlated binary variables with three common correlation structures. However, it is infeasible to specify unequal probabilities in their methods. In this manuscript, we introduce several computationally efficient algorithms that generate high-dimensional binary data with specified correlation structures and unequal probabilities. Our algorithms have linear time complexity with respect to the dimension for three commonly studied correlation structures, namely exchangeable, decaying-product and K-dependent correlation structures. In addition, we extend our algorithms to generate binary data of general non-negative correlation matrices with quadratic time complexity. We provide an R package, CorBin, to implement our simulation methods. Compared to the existing packages for binary data generation, the time cost to generate a 100-dimensional binary vector with the common correlation structures and general correlation matrices can be reduced up to $10^5$ folds and $10^3$ folds, respectively, and the efficiency can be further improved with the increase of dimensions. The R package CorBin is available on CRAN at https://cran.r-project.org/.

preprint2020arXiv

Inference of Dynamic Graph Changes for Functional Connectome

Dynamic functional connectivity is an effective measure for the brain's responses to continuous stimuli. We propose an inferential method to detect the dynamic changes of brain networks based on time-varying graphical models. Whereas most existing methods focus on testing the existence of change points, the dynamics in the brain network offer more signals in many neuroscience studies. We propose a novel method to conduct hypothesis testing on changes in dynamic brain networks. We introduce a bootstrap statistic to approximate the supreme of the high-dimensional empirical processes over dynamically changing edges. Our simulations show that this framework can capture the change points with changed connectivity. Finally, we apply our method to a brain imaging dataset under a natural audio-video stimulus and illustrate that we are able to detect temporal changes in brain networks. The functions of the identified regions are consistent with specific emotional annotations, which are closely associated with changes inferred by our method.

preprint2016arXiv

A Bayesian Semiparametric Factor Analysis Model for Subtype Identification

Disease subtype identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to infer disease subtypes, which often lead to biologically meaningful insights into disease. Despite many successes, existing clustering methods may not perform well when genes are highly correlated and many uninformative genes are included for clustering due to the high dimensionality. In this article, we introduce a novel subtype identification method in the Bayesian setting based on gene expression profiles. This method, called BCSub, adopts an innovative semiparametric Bayesian factor analysis model to reduce the dimension of the data to a few factor scores for clustering. Specifically, the factor scores are assumed to follow the Dirichlet process mixture model in order to induce clustering. Through extensive simulation studies, we show that BCSub has improved performance over commonly used clustering methods. When applied to two gene expression datasets, our model is able to identify subtypes that are clinically more relevant than those identified from the existing methods.

preprint2016arXiv

A Dirichlet Process Mixture Model for Clustering Longitudinal Gene Expression Data

Subgroup identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to define subgroups. Longitudinal gene expression profiles might provide additional information on disease progression than what is captured by baseline profiles alone. Moreover, the longitudinal gene expression data allows for intra-individual variability to be accounted for when grouping patients. Therefore, subgroup identification could be more accurate and effective with the aid of longitudinal gene expression data. However, existing statistical methods are unable to fully utilize these data for patient clustering. In this article, we introduce a novel subgroup identification method in the Bayesian setting based on longitudinal gene expression profiles. This method, called BClustLonG, adopts a linear mixed-effects framework to model the trajectory of genes over time while clustering is jointly conducted based on the regression coefficients obtained from all genes. In order to account for the correlations among genes and alleviate the high dimensionality challenges, we adopt a factor analysis model for the regression coefficients. The Dirichlet process prior distribution is utilized for the means of the regression coefficients to induce clustering. Through extensive simulation studies, we show that BClustLonG has improved performance over other clustering methods. When applied to a dataset of severely injured (burn or trauma) patients, our model is able to distinguish burn patients from trauma patients and identify interesting subgroups in trauma patients.

preprint2016arXiv

Distance-Correlation based Gene Set Analysis in Longitudinal Studies

Longitudinal gene expression profiles of patients are collected in some clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and hence may lead to better treatments. In this article, we propose a Distance-Correlation based Gene Set Analysis (dcGSA) method for longitudinal gene expression data. dcGSA is a non-parametric approach, statistically robust, and can capture both linear and nonlinear relationships between gene sets and clinical outcomes. In addition, dcGSA is able to identify related gene sets in cases where the effects of gene sets on clinical outcomes differ across patients due to the patient heterogeneity, alleviate the confounding effects of some unobserved covariates, and allow the assessment of associations between gene sets and multiple related outcomes simultaneously. Through extensive simulation studies, we demonstrate that dcGSA is more powerful of detecting relevant genes than other commonly used gene set analysis methods. When dcGSA is applied to multiple real datasets of different characteristics, we are able to identify more disease related gene sets than other methods.

preprint2016arXiv

Joint Models for Time-to-Event Data and Longitudinal Biomarkers of High Dimension

Joint models for longitudinal biomarkers and time-to-event data are widely used in longitudinal studies. Many joint modeling approaches have been proposed to deal with different types of longitudinal biomarkers and survival outcomes. However, most existing joint modeling methods cannot deal with a large number of longitudinal biomarkers simultaneously, such as the longitudinally collected gene expression profiles. In this article, we propose a new joint modeling method under the Bayesian framework, which is able to deal with longitudinal biomarkers of high dimension. Specifically, we assume that only a few unobserved latent variables are related to the survival outcome and the latent variables are inferred using a factor analysis model, which greatly reduces the dimensionality of the biomarkers and also accounts for the high correlations among the biomarkers. Through extensive simulation studies, we show that our proposed method has improved prediction accuracy over other joint modeling methods. We illustrate the usefulness of our method on a dataset of idiopathic pulmonary fibrosis patients in which we are interested in predicting the patients' time-to-death using their gene expression profiles.

preprint2016arXiv

Ultrahigh Dimensional Feature Selection via Kernel Canonical Correlation Analysis

High-dimensional variable selection is an important issue in many scientific fields, such as genomics. In this paper, we develop a sure independence feature screening pro- cedure based on kernel canonical correlation analysis (KCCA-SIS, for short). KCCA- SIS is easy to be implemented and applied. Compared to the sure independence screen- ing procedure based on the Pearson correlation (SIS, for short) developed by Fan and Lv [2008], KCCA-SIS can handle nonlinear dependencies among variables. Compared to the sure independence screening procedure based on the distance correlation (DC- SIS, for short) proposed by Li et al. [2012], KCCA-SIS is scale free, distribution free and has better approximation results based on the universal characteristic of Gaussian Kernel (Micchelli et al. [2006]). KCCA-SIS is more general than SIS and DC-SIS in the sense that SIS and DC-SIS correspond to certain choice of kernels. Compared to supremum of Hilbert Schmidt independence criterion-Sure independence screening (sup-HSIC-SIS, for short) developed by Balasubramanian et al. [2013], KCCA-SIS is scale free removing the marginal variation of features and response variables. No model assumption is needed between response and predictors to apply KCCA-SIS and it can be used in ultrahigh dimensional data analysis. Similar to DC-SIS and sup- HSIC-SIS, KCCA-SIS can also be used directly to screen grouped predictors and for multivariate response variables. We show that KCCA-SIS has the sure screening prop- erty, and has better performance through simulation studies. We applied KCCA-SIS to study Autism genes in a spatiotemporal gene expression dataset for human brain development, and obtained better results based on gene ontology enrichment analysis comparing to the other existing methods.

preprint2015arXiv

A Markov random field-based approach to characterizing human brain development using spatial-temporal transcriptome data

Human neurodevelopment is a highly regulated biological process. In this article, we study the dynamic changes of neurodevelopment through the analysis of human brain microarray data, sampled from 16 brain regions in 15 time periods of neurodevelopment. We develop a two-step inferential procedure to identify expressed and unexpressed genes and to detect differentially expressed genes between adjacent time periods. Markov Random Field (MRF) models are used to efficiently utilize the information embedded in brain region similarity and temporal dependency in our approach. We develop and implement a Monte Carlo expectation-maximization (MCEM) algorithm to estimate the model parameters. Simulation studies suggest that our approach achieves lower misclassification error and potential gain in power compared with models not incorporating spatial similarity and temporal dependency.

preprint2015arXiv

NBLDA: Negative Binomial Linear Discriminant Analysis for RNA-Seq Data

RNA-sequencing (RNA-Seq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNA-Seq data, they are not ideal due to the discrete nature of RNA-Seq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson assumption may not be as appropriate as negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated. In this paper, we propose a negative binomial linear discriminant analysis for RNA-Seq data. By Bayes' rule, we construct the classifier by fitting a negative binomial model, and propose some plug-in rules to estimate the unknown parameters in the classifier. The relationship between the negative binomial classifier and the Poisson classifier is explored, with a numerical investigation of the impact of dispersion on the discriminant score. Simulation results show the superiority of our proposed method. We also analyze four real RNA-Seq data sets to demonstrate the advantage of our method in real-world applications.

preprint2015arXiv

On Joint Estimation of Gaussian Graphical Models for Spatial and Temporal Data

In this paper, we first propose a Bayesian neighborhood selection method to estimate Gaussian Graphical Models (GGMs). We show the graph selection consistency of this method in the sense that the posterior probability of the true model converges to one. When there are multiple groups of data available, instead of estimating the networks independently for each group, joint estimation of the networks may utilize the shared information among groups and lead to improved estimation for each individual network. Our method is extended to jointly estimate GGMs in multiple groups of data with complex structures, including spatial data, temporal data and data with both spatial and temporal structures. Markov random field (MRF) models are used to efficiently incorporate the complex data structures. We develop and implement an efficient algorithm for statistical inference that enables parallel computing. Simulation studies suggest that our approach achieves better accuracy in network estimation compared with methods not incorporating spatial and temporal dependencies when there are shared structures among the networks, and that it performs comparably well otherwise. Finally, we illustrate our method using the human brain gene expression microarray dataset, where the expression levels of genes are measured in different brain regions across multiple time periods.

preprint2015arXiv

Posterior Contraction Rates of the Phylogenetic Indian Buffet Processes

By expressing prior distributions as general stochastic processes, nonparametric Bayesian methods provide a flexible way to incorporate prior knowledge and constrain the latent structure in statistical inference. The Indian buffet process (IBP) is such an example that can be used to define a prior distribution on infinite binary features, where the exchangeability among subjects is assumed. The phylogenetic Indian buffet process (pIBP), a derivative of IBP, enables the modeling of non-exchangeability among subjects through a stochastic process on a rooted tree, which is similar to that used in phylogenetics, to describe relationships among the subjects. In this paper, we study the theoretical properties of IBP and pIBP under a binary factor model. We establish the posterior contraction rates for both IBP and pIBP and substantiate the theoretical results through simulation studies. This is the first work addressing the frequentist property of the posterior behaviors of IBP and pIBP. We also demonstrated its practical usefulness by applying pIBP prior to a real data example arising in the field of cancer genomics where the exchangeability among subjects is violated.

preprint2014arXiv

Change Point Analysis of Histone Modifications Reveals Epigenetic Blocks Linking to Physical Domains

Histone modification is a vital epigenetic mechanism for transcriptional control in eukaryotes. High-throughput techniques have enabled whole-genome analysis of histone modifications in recent years. However, most studies assume one combination of histone modification invariantly translates to one transcriptional output regardless of local chromatin environment. In this study we hypothesize that, the genome is organized into local domains that manifest similar enrichment pattern of histone modification, which leads to orchestrated regulation of expression of genes with relevant bio- logical functions. We propose a multivariate Bayesian Change Point (BCP) model to segment the Drosophila melanogaster genome into consecutive blocks on the basis of combinatorial patterns of histone marks. By modeling the sparse distribution of histone marks across the chromosome with a zero-inflated Gaussian mixture, our partitions capture local BLOCKs that manifest relatively homogeneous enrichment pattern of histone modifications. We further characterized BLOCKs by their transcription levels, distribution of genes, degree of co-regulation and GO enrichment. Our results demonstrate that these BLOCKs, although inferred merely from histone modifications, reveal strong relevance with physical domains, which suggest their important roles in chromatin organization and coordinated gene regulation.

preprint2014arXiv

Detection boundary and Higher Criticism approach for rare and weak genetic effects

Genome-wide association studies (GWAS) have identified many genetic factors underlying complex human traits. However, these factors have explained only a small fraction of these traits' genetic heritability. It is argued that many more genetic factors remain undiscovered. These genetic factors likely are weakly associated at the population level and sparsely distributed across the genome. In this paper, we adapt the recent innovations on Tukey's Higher Criticism (Tukey [The Higher Criticism (1976) Princeton Univ.]; Donoho and Jin [Ann. Statist. 32 (2004) 962-994]) to SNP-set analysis of GWAS, and develop a new theoretical framework in large-scale inference to assess the joint significance of such rare and weak effects for a quantitative trait. In the core of our theory is the so-called detection boundary, a curve in the two-dimensional phase space that quantifies the rarity and strength of genetic effects. Above the detection boundary, the overall effects of genetic factors are strong enough for reliable detection. Below the detection boundary, the genetic factors are simply too rare and too weak for reliable detection. We show that the HC-type methods are optimal in that they reliably yield detection once the parameters of the genetic effects fall above the detection boundary and that many commonly used SNP-set methods are suboptimal. The superior performance of the HC-type approach is demonstrated through simulations and the analysis of a GWAS data set of Crohn's disease.

preprint2014arXiv

Estimation for ultra-high dimensional factor model: a pivotal variable detection based approach

For factor model, the involved covariance matrix often has no row sparse structure because the common factors may lead some variables to strongly associate with many others. Under the ultra-high dimensional paradigm, this feature causes existing methods for sparse covariance matrix in the literature not directly applicable. In this paper, for general covariance matrix, a novel approach to detect these variables that is called the pivotal variables is suggested. Then, two-stage estimation procedures are proposed to handle ultra-high dimensionality in factor model. In these procedures, pivotal variable detection is performed as a screening step and then existing approaches are applied to refine the working model. The estimation efficiency can be promoted under weaker assumptions on the model structure. Simulations are conducted to examine the performance of the new method and a real dataset is analysed for illustration.

preprint2014arXiv

GPA: A statistical approach to prioritizing GWAS results by integrating pleiotropy information and annotation data

Genome-wide association studies (GWAS) suggests that a complex disease is typically affected by many genetic variants with small or moderate effects. Identification of these risk variants remains to be a very challenging problem. Traditional approaches focusing on a single GWAS dataset alone ignore relevant information that could potentially improve our ability to detect these variants. We proposed a novel statistical approach, named GPA, to performing integrative analysis of multiple GWAS datasets and functional annotations. Hypothesis testing procedures were developed to facilitate statistical inference of pleiotropy and enrichment of functional annotation. We applied our approach to perform systematic analysis of five psychiatric disorders. Not only did GPA identify many weak signals missed by the original single phenotype analysis, but also revealed interesting genetic architectures of these disorders. We also applied GPA to the bladder cancer GWAS data with the ENCODE DNase-seq data from 125 cell lines and showed that GPA can detect cell lines that are more biologically relevant to the phenotype of interest.

preprint2014arXiv

High-dimensional genome-wide association study and misspecified mixed model analysis

We study behavior of the restricted maximum likelihood (REML) estimator under a misspecified linear mixed model (LMM) that has received much attention in recent gnome-wide association studies. The asymptotic analysis establishes consistency of the REML estimator of the variance of the errors in the LMM, and convergence in probability of the REML estimator of the variance of the random effects in the LMM to a certain limit, which is equal to the true variance of the random effects multiplied by the limiting proportion of the nonzero random effects present in the LMM. The aymptotic results also establish convergence rate (in probability) of the REML estimators as well as a result regarding convergence of the asymptotic conditional variance of the REML estimator. The asymptotic results are fully supported by the results of empirical studies, which include extensive simulation studies that compare the performance of the REML estimator (under the misspecified LMM) with other existing methods.

preprint2014arXiv

Low-Rank Modeling and Its Applications in Image Analysis

Low-rank modeling generally refers to a class of methods that solve problems by representing variables of interest as low-rank matrices. It has achieved great success in various fields including computer vision, data mining, signal processing and bioinformatics. Recently, much progress has been made in theories, algorithms and applications of low-rank modeling, such as exact low-rank matrix recovery via convex programming and matrix completion applied to collaborative filtering. These advances have brought more and more attentions to this topic. In this paper, we review the recent advance of low-rank modeling, the state-of-the-art algorithms, and related applications in image analysis. We first give an overview to the concept of low-rank modeling and challenging problems in this area. Then, we summarize the models and algorithms for low-rank matrix recovery and illustrate their advantages and limitations with numerical experiments. Next, we introduce a few applications of low-rank modeling in the context of image analysis. Finally, we conclude this paper with some discussions.

preprint2013arXiv

A Graph Theoretic Approach to Utilizing Protein Structure to Identify Non-Random Somatic Mutations

Background: It is well known that the development of cancer is caused by the accumulation of somatic mutations within the genome. For oncogenes specifically, current research suggests that there is a small set of "driver" mutations that are primarily responsible for tumorigenesis. Further, due to some recent pharmacological successes in treating these driver mutations and their resulting tumors, a variety of methods have been developed to identify potential driver mutations using methods such as machine learning and mutational clustering. We propose a novel methodology that increases our power to identify mutational clusters by taking into account protein tertiary structure via a graph theoretical approach. Results: We have designed and implemented GraphPAC (Graph Protein Amino Acid Clustering) to identify mutational clustering while considering protein spatial structure. Using GraphPAC, we are able to detect novel clusters in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of prior clustering based on current methods. Specifically, by utilizing the spatial information available in the Protein Data Bank (PDB) along with the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC), GraphPAC identifies new mutational clusters in well known oncogenes such as EGFR and KRAS. Further, by utilizing graph theory to account for the tertiary structure, GraphPAC identifies clusters in DPP4, NRP1 and other proteins not identified by existing methods. The R package is available at: http://bioconductor.org/packages/release/bioc/html/GraphPAC.html Conclusion: GraphPAC provides an alternative to iPAC and an extension to current methodology when identifying potential activating driver mutations by utilizing a graph theoretic approach when considering protein tertiary structure.

preprint2013arXiv

A Penalized Multi-trait Mixed Model for Association Mapping in Pedigree-based GWAS

In genome-wide association studies (GWAS), penalization is an important approach for identifying genetic markers associated with trait while mixed model is successful in accounting for a complicated dependence structure among samples. Therefore, penalized linear mixed model is a tool that combines the advantages of penalization approach and linear mixed model. In this study, a GWAS with multiple highly correlated traits is analyzed. For GWAS with multiple quantitative traits that are highly correlated, the analysis using traits marginally inevitably lose some essential information among multiple traits. We propose a penalized-MTMM, a penalized multivariate linear mixed model that allows both the within-trait and between-trait variance components simultaneously for multiple traits. The proposed penalized-MTMM estimates variance components using an AI-REML method and conducts variable selection and point estimation simultaneously using group MCP and sparse group MCP. Best linear unbiased predictor (BLUP) is used to find predictive values and the Pearson's correlations between predictive values and their corresponding observations are used to evaluate prediction performance. Both prediction and selection performance of the proposed approach and its comparison with the uni-trait penalized-LMM are evaluated through simulation studies. We apply the proposed approach to a GWAS data from Genetic Analysis Workshop (GAW) 18.

preprint2013arXiv

A Spatial Simulation Approach to Account for Protein Structure When Identifying Non-Random Somatic Mutations

Background: Current research suggests that a small set of "driver" mutations are responsible for tumorigenesis while a larger body of "passenger" mutations occurs in the tumor but does not progress the disease. Due to recent pharmacological successes in treating cancers caused by driver mutations, a variety of of methodologies that attempt to identify such mutations have been developed. Based on the hypothesis that driver mutations tend to cluster in key regions of the protein, the development of cluster identification algorithms has become critical. Results: We have developed a novel methodology, SpacePAC (Spatial Protein Amino acid Clustering), that identifies mutational clustering by considering the protein tertiary structure directly in 3D space. By combining the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC) and the spatial information in the Protein Data Bank (PDB), SpacePAC is able to identify novel mutation clusters in many proteins such as FGFR3 and CHRM2. In addition, SpacePAC is better able to localize the most significant mutational hotspots as demonstrated in the cases of BRAF and ALK. The R package is available on Bioconductor at: http://www.bioconductor.org/packages/release/bioc/html/SpacePAC.html Conclusion: SpacePAC adds a valuable tool to the identification of mutational clusters while considering protein tertiary structure

preprint2013arXiv

Asymptotically Normal and Efficient Estimation of Covariate-Adjusted Gaussian Graphical Model

A tuning-free procedure is proposed to estimate the covariate-adjusted Gaussian graphical model. For each finite subgraph, this estimator is asymptotically normal and efficient. As a consequence, a confidence interval can be obtained for each edge. The procedure enjoys easy implementation and efficient computation through parallel estimation on subgraphs or edges. We further apply the asymptotic normality result to perform support recovery through edge-wise adaptive thresholding. This support recovery procedure is called ANTAC, standing for Asymptotically Normal estimation with Thresholding after Adjusting Covariates. ANTAC outperforms other methodologies in the literature in a range of simulation studies. We apply ANTAC to identify gene-gene interactions using an eQTL dataset. Our result achieves better interpretability and accuracy in comparison with CAMPE.

preprint2013arXiv

Improving genetic risk prediction by leveraging pleiotropy

An important task of human genetics studies is to accurately predict disease risks in individuals based on genetic markers, which allows for identifying individuals at high disease risks, and facilitating their disease treatment and prevention. Although hundreds of genome-wide association studies (GWAS) have been conducted on many complex human traits in recent years, there has been only limited success in translating these GWAS data into clinically useful risk prediction models. The predictive capability of GWAS data is largely bottlenecked by the available training sample size due to the presence of numerous variants carrying only small to modest effects. Recent studies have shown that different human traits may share common genetic bases. Therefore, an attractive strategy to increase the training sample size and hence improve the prediction accuracy is to integrate data of genetically correlated phenotypes. Yet the utility of genetic correlation in risk prediction has not been explored in the literature. In this paper, we analyzed GWAS data for bipolar and related disorders (BARD) and schizophrenia (SZ) with a bivariate ridge regression method, and found that jointly predicting the two phenotypes could substantially increase prediction accuracy as measured by the AUC (area under the receiver operating characteristic curve). We also found similar prediction accuracy improvements when we jointly analyzed GWAS data for Crohn's disease (CD) and ulcerative colitis (UC). The empirical observations were substantiated through our comprehensive simulation studies, suggesting that a gain in prediction accuracy can be obtained by combining phenotypes with relatively high genetic correlations. Through both real data and simulation studies, we demonstrated pleiotropy as a valuable asset that opens up a new opportunity to improve genetic risk prediction in the future.

preprint2013arXiv

Utilizing Protein Structure to Identify Non-Random Somatic Mutations

Motivation: Human cancer is caused by the accumulation of somatic mutations in tumor suppressors and oncogenes within the genome. In the case of oncogenes, recent theory suggests that there are only a few key "driver" mutations responsible for tumorigenesis. As there have been significant pharmacological successes in developing drugs that treat cancers that carry these driver mutations, several methods that rely on mutational clustering have been developed to identify them. However, these methods consider proteins as a single strand without taking their spatial structures into account. We propose a new methodology that incorporates protein tertiary structure in order to increase our power when identifying mutation clustering. Results: We have developed a novel algorithm, iPAC: identification of Protein Amino acid Clustering, for the identification of non-random somatic mutations in proteins that takes into account the three dimensional protein structure. By using the tertiary information, we are able to detect both novel clusters in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of clustering based on existing methods. For example, by combining the data in the Protein Data Bank (PDB) and the Catalogue of Somatic Mutations in Cancer, our algorithm identifies new mutational clusters in well known cancer proteins such as KRAS and PI3KCa. Further, by utilizing the tertiary structure, our algorithm also identifies clusters in EGFR, EIF2AK2, and other proteins that are not identified by current methodology.

preprint2011arXiv

Bayesian hierarchical modeling for signaling pathway inference from single cell interventional data

Recent technological advances have made it possible to simultaneously measure multiple protein activities at the single cell level. With such data collected under different stimulatory or inhibitory conditions, it is possible to infer the causal relationships among proteins from single cell interventional data. In this article we propose a Bayesian hierarchical modeling framework to infer the signaling pathway based on the posterior distributions of parameters in the model. Under this framework, we consider network sparsity and model the existence of an association between two proteins both at the overall level across all experiments and at each individual experimental level. This allows us to infer the pairs of proteins that are associated with each other and their causal relationships. We also explicitly consider both intrinsic noise and measurement error. Markov chain Monte Carlo is implemented for statistical inference. We demonstrate that this hierarchical modeling can effectively pool information from different interventional experiments through simulation studies and real data analysis.

preprint2010arXiv

Asymptotic efficiency and finite-sample properties of the generalized profiling estimation of parameters in ordinary differential equations

Ordinary differential equations (ODEs) are commonly used to model dynamic behavior of a system. Because many parameters are unknown and have to be estimated from the observed data, there is growing interest in statistics to develop efficient estimation procedures for these parameters. Among the proposed methods in the literature, the generalized profiling estimation method developed by Ramsay and colleagues is particularly promising for its computational efficiency and good performance. In this approach, the ODE solution is approximated with a linear combination of basis functions. The coefficients of the basis functions are estimated by a penalized smoothing procedure with an ODE-defined penalty. However, the statistical properties of this procedure are not known. In this paper, we first give an upper bound on the uniform norm of the difference between the true solutions and their approximations. Then we use this bound to prove the consistency and asymptotic normality of this estimation procedure. We show that the asymptotic covariance matrix is the same as that of the maximum likelihood estimation. Therefore, this procedure is asymptotically efficient. For a fixed sample and fixed basis functions, we study the limiting behavior of the approximation when the smoothing parameter tends to infinity. We propose an algorithm to choose the smoothing parameters and a method to compute the deviation of the spline approximation from solution without solving the ODEs.

Hongyu Zhao

What is connected

Connect this record

See the researcher in context

Building this map preview

31 published item(s)

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Improve Power of Knockoffs with Annotation Information of Covariates

Statistical Inference of Cell-type Proportions Estimated from Bulk Expression Data

A general kernel boosting framework integrating pathways for predictive modeling based on genomic data

Variance Estimation and Confidence Intervals from High-dimensional Genome-wide Association Studies Through Misspecified Mixed Model Analysis

A Hidden Markov Model Based Unsupervised Algorithm for Sleep/Wake Identification Using Actigraphy

A set of efficient methods to generate high-dimensional binary data with specified correlation structures

Inference of Dynamic Graph Changes for Functional Connectome

A Bayesian Semiparametric Factor Analysis Model for Subtype Identification

A Dirichlet Process Mixture Model for Clustering Longitudinal Gene Expression Data

Distance-Correlation based Gene Set Analysis in Longitudinal Studies

Joint Models for Time-to-Event Data and Longitudinal Biomarkers of High Dimension

Ultrahigh Dimensional Feature Selection via Kernel Canonical Correlation Analysis

A Markov random field-based approach to characterizing human brain development using spatial-temporal transcriptome data

NBLDA: Negative Binomial Linear Discriminant Analysis for RNA-Seq Data

On Joint Estimation of Gaussian Graphical Models for Spatial and Temporal Data

Posterior Contraction Rates of the Phylogenetic Indian Buffet Processes

Change Point Analysis of Histone Modifications Reveals Epigenetic Blocks Linking to Physical Domains

Detection boundary and Higher Criticism approach for rare and weak genetic effects

Estimation for ultra-high dimensional factor model: a pivotal variable detection based approach

GPA: A statistical approach to prioritizing GWAS results by integrating pleiotropy information and annotation data

High-dimensional genome-wide association study and misspecified mixed model analysis

Low-Rank Modeling and Its Applications in Image Analysis

A Graph Theoretic Approach to Utilizing Protein Structure to Identify Non-Random Somatic Mutations

A Penalized Multi-trait Mixed Model for Association Mapping in Pedigree-based GWAS

A Spatial Simulation Approach to Account for Protein Structure When Identifying Non-Random Somatic Mutations

Asymptotically Normal and Efficient Estimation of Covariate-Adjusted Gaussian Graphical Model

Improving genetic risk prediction by leveraging pleiotropy

Utilizing Protein Structure to Identify Non-Random Somatic Mutations

Bayesian hierarchical modeling for signaling pathway inference from single cell interventional data

Asymptotic efficiency and finite-sample properties of the generalized profiling estimation of parameters in ordinary differential equations