Researcher profile

Guanyu Hu

Guanyu Hu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2026arXiv

Fair Domain Generalization: An Information-Theoretic View

Domain generalization (DG) and algorithmic fairness are two critical challenges in machine learning. However, most DG methods focus only on minimizing expected risk in the unseen target domain without considering algorithmic fairness. Conversely, fairness methods typically do not account for domain shifts, so the fairness achieved during training may not generalize to unseen test domains. In this work, we bridge these gaps by studying the problem of Fair Domain Generalization (FairDG), which aims to minimize both expected risk and fairness violations in unseen target domains. We derive novel mutual information-based upper bounds for expected risk and fairness violations in multi-class classification tasks with multi-group sensitive attributes. These bounds provide key insights for algorithm design from an information-theoretic perspective. Guided by these insights, we introduce PAFDG (Pareto-Optimal Fairness for Domain Generalization), a practical framework that solves the FairDG problem and models the utility-fairness trade-off through Pareto optimization. Experiments on real-world vision and language datasets show that PAFDG achieves superior utility-fairness trade-offs compared to existing methods.

preprint2022arXiv

Multidimensional heterogeneity learning for count value tensor data with applications to field goal attempt analysis of NBA players

We propose a multidimensional tensor clustering approach for studying how professional basketball players' shooting patterns vary over court locations and game time. Unlike most existing methods that only study continuous-valued tensors or have to assume the same cluster structure along different tensor directions, we propose a Bayesian nonparametric model that deals with count-valued tensors and projects the heterogeneity among players onto tensor dimensions while allowing cluster structures to be different over directions. Our method is fully probabilistic; hence allows simultaneous inference on both the number of clusters and the cluster configurations. We present an efficient posterior sampling method and establish the large-sample convergence properties for the posterior distribution. Simulation studies have demonstrated an excellent empirical performance of the proposed method. Finally, an application to shot chart data collected from 191 NBA players during the 2017-2018 regular season is conducted and reveals several interesting insights for basketball analytics.

preprint2021arXiv

Bayesian Spatial Homogeneity Pursuit for Survival Data with an Application to the SEER Respiratory Cancer Data

In this work, we propose a new Bayesian spatial homogeneity pursuit method for survival data under the proportional hazards model to detect spatially clustered patterns in baseline hazard and regression coefficients. Specially, regression coefficients and baseline hazard are assumed to have spatial homogeneity pattern over space. To capture such homogeneity, we develop a geographically weighted Chinese restaurant process prior to simultaneously estimate coefficients and baseline hazards and their uncertainty measures. An efficient Markov chain Monte Carlo (MCMC) algorithm is designed for our proposed methods. Performance is evaluated using simulated data, and further applied to a real data analysis of respiratory cancer in the state of Louisiana.

preprint2020arXiv

A comparison of Bayesian accelerated failure time models with spatially varying coefficients

The accelerated failure time (AFT) model is a commonly used tool in analyzing survival data. In public health studies, data is often collected from medical service providers in different locations. Survival rates from different locations often present geographically varying patterns. In this paper, we focus on the accelerated failure time model with spatially varying coefficients. We compare three types of the priors for spatially varying coefficients. A model selection criterion, logarithm of the pseudo-marginal likelihood (LPML), is developed to assess the fit of AFT model with different priors. Extensive simulation studies are carried out to examine the empirical performance of the proposed methods. Finally, we apply our model to SEER data on prostate cancer in Louisiana and demonstrate the existence of spatially varying effects on survival rates from prostate cancer data.

preprint2020arXiv

A Nonparametric Bayesian Item Response Modeling Approach for Clustering Items and Individuals Simultaneously

Item response theory (IRT) is a popular modeling paradigm for measuring subject latent traits and item properties according to discrete responses in tests or questionnaires. There are very limited discussions on heterogeneity pattern detection for both items and individuals. In this paper, we introduce a nonparametric Bayesian approach for clustering items and individuals simultaneously under the Rasch model. Specifically, our proposed method is based on the mixture of finite mixtures (MFM) model. MFM obtains the number of clusters and the clustering configurations for both items and individuals simultaneously. The performance of parameters estimation and parameters clustering under the MFM Rasch model is evaluated by simulation studies, and a real date set is applied to illustrate the MFM Rasch modeling.

preprint2020arXiv

Bayesian Hierarchical Spatial Regression Models for Spatial Data in the Presence of Missing Covariates with Applications

In many applications, survey data are collected from different survey centers in different regions. It happens that in some circumstances, response variables are completely observed while the covariates have missing values. In this paper, we propose a joint spatial regression model for the response variable and missing covariates via a sequence of one-dimensional conditional spatial regression models. We further construct a joint spatial model for missing covariate data mechanisms. The properties of the proposed models are examined and a Markov chain Monte Carlo sampling algorithm is used to sample from the posterior distribution. In addition, the Bayesian model comparison criteria, the modified Deviance Information Criterion (mDIC) and the modified Logarithm of the Pseudo-Marginal Likelihood (mLPML), are developed to assess the fit of spatial regression models for spatial data. Extensive simulation studies are carried out to examine empirical performance of the proposed methods. We further apply the proposed methodology to analyze a real data set from a Chinese Health and Nutrition Survey (CHNS) conducted in 2011.

preprint2020arXiv

Geographically Weighted Regression Analysis for Spatial Economics Data: a Bayesian Recourse

The geographically weighted regression (GWR) is a well-known statistical approach to explore spatial non-stationarity of the regression relationship in spatial data analysis. In this paper, we discuss a Bayesian recourse of GWR. Bayesian variable selection based on spike-and-slab prior, bandwidth selection based on range prior, and model assessment using a modified deviance information criterion and a modified logarithm of pseudo-marginal likelihood are fully discussed in this paper. Usage of the graph distance in modeling areal data is also introduced. Extensive simulation studies are carried out to examine the empirical performance of the proposed methods with both small and large number of location scenarios, and comparison with the classical frequentist GWR is made. The performance of variable selection and estimation of the proposed methodology under different circumstances are satisfactory. We further apply the proposed methodology in analysis of a province-level macroeconomic data of 30 selected provinces in China. The estimation and variable selection results reveal insights about China's economy that are convincing and agree with previous studies and facts.

preprint2020arXiv

Heterogeneity Learning for SIRS model: an Application to the COVID-19

We propose a Bayesian Heterogeneity Learning approach for Susceptible-Infected-Removal-Susceptible (SIRS) model that allows underlying clustering patterns for transmission rate, recovery rate, and loss of immunity rate for the latest coronavirus (COVID-19) among different regions. Our proposed method provides simultaneously inference on parameter estimation and clustering information which contains both number of clusters and cluster configurations. Specifically, our key idea is to formulates the SIRS model into a hierarchical form and assign the Mixture of Finite mixtures priors for heterogeneity learning. The properties of the proposed models are examined and a Markov chain Monte Carlo sampling algorithm is used to sample from the posterior distribution. Extensive simulation studies are carried out to examine empirical performance of the proposed methods. We further apply the proposed methodology to analyze the state level COVID-19 data in U.S.

preprint2020arXiv

Heterogeneous Regression Models for Clusters of Spatial Dependent Data

In economic development, there are often regions that share similar economic characteristics, and economic models on such regions tend to have similar covariate effects. In this paper, we propose a Bayesian clustered regression for spatially dependent data in order to detect clusters in the covariate effects. Our proposed method is based on the Dirichlet process which provides a probabilistic framework for simultaneous inference of the number of clusters and the clustering configurations. The usage of our method is illustrated both in simulation studies and an application to a housing cost dataset of Georgia.

preprint2020arXiv

Most Likely Optimal Subsampled Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) requires to evaluate the full data likelihood at different parameter values iteratively and is often computationally infeasible for large data sets. In this paper, we propose to approximate the log-likelihood with subsamples taken according to nonuniform subsampling probabilities, and derive the most likely optimal (MLO) subsampling probabilities for better approximation. Compared with existing subsampled MCMC algorithm with equal subsampling probabilities, our MLO subsampled MCMC has a higher estimation efficiency with the same subsampling ratio. We also derive a formula using the asymptotic distribution of the subsampled log-likelihood to determine the required subsample size in each MCMC iteration for a given level of precision. This formula is used to develop an adaptive version of the MLO subsampled MCMC algorithm. Numerical experiments demonstrate that the proposed method outperforms the uniform subsampled MCMC.

preprint2020arXiv

Spatial homogeneity learning for spatially correlated functional data with application to COVID-19 Growth rate curves

We study the spatial heterogeneity effect on regional COVID-19 pandemic timing and severity by analyzing the COVID-19 growth rate curves in the United States. We propose a geographically detailed functional data grouping method equipped with a functional conditional autoregressive (CAR) prior to fully capture the spatial correlation in the pandemic curves. The spatial homogeneity pattern can then be detected by a geographically weighted Chinese restaurant process prior which allows both locally spatially contiguous groups and globally discontiguous groups. We design an efficient Markov chain Monte Carlo (MCMC) algorithm to simultaneously infer the posterior distributions of the number of groups and the grouping configuration of spatial functional data. The superior numerical performance of the proposed method over competing methods is demonstrated using simulated studies and an application to COVID-19 state-level and county-level data study in the United States.

preprint2019arXiv

Geographically Weighted Cox Regression for Prostate Cancer Survival Data in Louisiana

The Cox proportional hazard model is one of the most popular tools in analyzing time-to-event data in public health studies. When outcomes observed in clinical data from different regions yield a varying pattern correlated with location, it is often of great interest to investigate spatially varying effects of covariates. In this paper, we propose a geographically weighted Cox regression model for sparse spatial survival data. In addition, a stochastic neighborhood weighting scheme is introduced at the county level. Theoretical properties of the proposed geographically weighted estimators are examined in detail. A model selection scheme based on the Takeuchi's model robust information criteria (TIC) is discussed. Extensive simulation studies are carried out to examine the empirical performance of the proposed methods. We further apply the proposed methodology to analyze real data on prostate cancer from the Surveillance, Epidemiology, and End Results cancer registry for the state of Louisiana.