Researcher profile

Hongzhe Li

Hongzhe Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2023arXiv

HT-MMIOW: A Hypothesis Test approach for Microbiome Mediation using Inverse Odds Weighting

The human microbiome has an important role in determining health. Mediation analyses quantify the contribution of the microbiome in the causal path between exposure and disease; however, current mediation models cannot fully capture the high dimensional, correlated, and compositional nature of microbiome data and do not typically accommodate dichotomous outcomes. We propose a novel approach that uses inverse odds weighting to test for the mediating effect of the microbiome. We use simulation to demonstrate that our approach gains power for high dimensional mediators, and it is agnostic to the effect of interactions between the exposure and mediators. Our application to infant gut microbiome data from the New Hampshire Birth Cohort Study revealed a mediating effect of 6-week infant gut microbiome on the relationship between maternal prenatal antibiotic use during pregnancy and incidence of childhood allergy by 5 years of age.

preprint2022arXiv

Estimation and Inference with Proxy Data and its Genetic Applications

Existing high-dimensional statistical methods are largely established for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models where we only observe "proxy data", which include the marginal statistics and sample covariance matrix that are computed based on different sets of individuals. We develop a rate optimal method for estimation and inference for the regression coefficient vector and its linear functionals based on the proxy data. Moreover, we show the intrinsic limitations in the proxy-data based inference: the minimax optimal rate for estimation is slower than that in the conventional case where individual data are observed; the power for testing and multiple testing does not go to one as the signal strength goes to infinity. These interesting findings are illustrated through simulation studies and an analysis of a dataset concerning the genetic associations of hindlimb muscle weight in a mouse population.

preprint2022arXiv

Mediation analysis with densities as mediators with an application to iCOMPARE trial

Physical activity has long been shown to be associated with biological and physiological performance and risk of diseases. It is of great interest to assess whether the effect of an exposure or intervention on an outcome is mediated through physical activity measured by modern wearable devices such as actigraphy. However, existing methods for mediation analysis focus almost exclusively on mediation variable that is in the Euclidean space, which cannot be applied directly to the actigraphy data of physical activity. Such data is best summarized in the form of an histogram or density. In this paper, we extend the structural equation models (SEMs) to the settings where a density is treated as the mediator to study the indirect mediation effect of physical activity on an outcome. We provide sufficient conditions for identifying the average causal effects of density mediator and present methods for estimating the direct and mediating effects of density on an outcome. We apply our method to the data set from the iCOMPARE trial that compares flexible duty-hour policies and standard duty-hour policies on interns' sleep related outcomes to explore the mediation effect of physical activity on the causal path between flexible duty-hour policies and sleep related outcomes.

preprint2022arXiv

Truncated Rank-Based Tests for Two-Part Models with Excessive Zeros and Applications to Microbiome Data

High-throughput sequencing technology allows us to test the compositional difference of bacteria in different populations. One important feature of human microbiome data is that it often includes a large number of zeros. Such data can be treated as being generated from a two-part model that includes a zero point-mass. Motivated by analysis of such non-negative data with excessive zeros, we introduce several truncated rank-based two-group and multi-group tests for such data, including a truncated rank-based Wilcoxon rank-sum test for two-group comparison and two truncated Kruskal-Wallis tests for multi-group comparison. We show both analytically through asymptotic relative efficiency analysis and by simulations that the proposed tests have higher power than the standard rank-based tests, especially when the proportion of zeros in the data is high. The tests can also be applied to repeated measurements of compositional data via simple within-subject permutations. In a simple before-and-after treatment experiment, the within-subject permutation is similar to the paired rank test. However, the proposed tests handle the excessive zeros, which leads to a better power. We apply the tests to the analysis of a gut microbiome data set to compare the microbiome compositions of healthy and pediatric Crohn's disease patients and to assess the treatment effects on microbiome compositions. We identify several bacterial genera that are missed by the standard rank-based tests.

preprint2021arXiv

Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach

Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the dimension of the random effects and the cluster sizes are possibly large. Regarding the fixed effects, we provide rate optimal estimators and valid inference procedures that do not rely on the structural information of the variance components. We also study the estimation of variance components with high-dimensional fixed effects in general settings. The algorithms are easy to implement and computationally fast. The proposed methods are assessed in various simulation settings and are applied to a real study regarding the associations between body mass index and genetic polymorphic markers in a heterogeneous stock mice population.

preprint2021arXiv

Information content of high-order associations of the human gut microbiota network

The human gastrointestinal tract is an environment that hosts an ecosystem of microorganisms essential to human health. Vital biological processes emerge from fundamental inter- and intra-species molecular interactions that influence the assembly and composition of the gut microbiota ecology. Here we quantify the complexity of the ecological relationships within the human infant gut microbiota ecosystem as a function of the information contained in the nonlinear associations of a sequence of increasingly-specified maximum entropy representations of the system. Our paradigm frames the ecological state, in terms of the presence or absence of individual microbial ecological units that are identified by amplicon sequence variants (ASV) in the gut microenvironment, as a function of both the ecological states of its neighboring units and, in a departure from standard graphical model representations, the associations among the units within its neighborhood. We characterize the order of the system based on the relative quantity of statistical information encoded by high-order statistical associations of the infant gut microbiota.

preprint2020arXiv

Optimal Permutation Recovery in Permuted Monotone Matrix Model

Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model $Y=ΘΠ+Z$, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, $Θ$ is an unknown mean matrix with monotone entries for each row, $Π$ is a permutation matrix that permutes the columns of $Θ$, and $Z$ is a noise matrix. This paper studies the problem of estimation/recovery of $Π$ given the observed noisy matrix $Y$. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0-1 loss, and partial recovery, as quantified by the normalized Kendall's tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the non-responders of the IBD patients after 8 weeks of treatment.

preprint2020arXiv

Transfer Learning for High-dimensional Linear Regression: Prediction, Estimation, and Minimax Optimality

This paper considers the estimation and prediction of a high-dimensional linear regression in the setting of transfer learning, using samples from the target model as well as auxiliary samples from different but possibly related regression models. When the set of "informative" auxiliary samples is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. In the case that the set of informative auxiliary samples is unknown, we propose a data-driven procedure for transfer learning, called Trans-Lasso, and reveal its robustness to non-informative auxiliary samples and its efficiency in knowledge transfer. The proposed procedures are demonstrated in numerical studies and are applied to a dataset concerning the associations among gene expressions. It is shown that Trans-Lasso leads to improved performance in gene expression prediction in a target tissue by incorporating the data from multiple different tissues as auxiliary samples.