Source author record

Xianyang Zhang

Xianyang Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology math.ST Statistics Theory Machine Learning

Catalog footprint

What is connected

12works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Modern Theory for High-dimensional Cox Regression Models

The proportional hazards model has been extensively used in many fields such as biomedicine to estimate and perform statistical significance testing on the effects of covariates influencing the survival time of patients. The classical theory of maximum partial-likelihood estimation (MPLE) is used by most software packages to produce inference, e.g., the coxph function in R and the PHREG procedure in SAS. In this paper, we investigate the asymptotic behavior of the MPLE in the regime in which the number of parameters p is of the same order as the number of samples n. The main results are (i) existence of the MPLE undergoes a sharp 'phase transition'; (ii) the classical MPLE theory leads to invalid inference in the high-dimensional regime. We show that the asymptotic behavior of the MPLE is governed by a new asymptotic theory. These findings are further corroborated through numerical studies. The main technical tool in our proofs is the Convex Gaussian Min-max Theorem (CGMT), which has not been previously used in the analysis of partial likelihood. Our results thus extend the scope of CGMT and shed new light on the use of CGMT for examining the existence of MPLE and non-separable objective functions.

preprint2026arXiv

Fair Regression under Demographic Parity: A Unified Framework

We propose a unified framework for fair regression tasks formulated as risk minimization problems subject to a demographic parity constraint. Unlike many existing approaches that are limited to specific loss functions or rely on challenging non-convex optimization, our framework is applicable to a broad spectrum of regression tasks. Examples include linear regression with squared loss, binary classification with cross-entropy loss, quantile regression with pinball loss, and robust regression with Huber loss. We derive a novel characterization of the fair risk minimizer, which yields a computationally efficient estimation procedure for general loss functions. Theoretically, we establish the asymptotic consistency of the proposed estimator and derive its convergence rates under mild assumptions. We illustrate the method's versatility through detailed discussions of several common loss functions. Numerical results demonstrate that our approach effectively minimizes risk while satisfying fairness constraints across various regression settings.

preprint2023arXiv

High Dimensional Analysis of Variance in Multivariate Linear Regression

In this paper, we develop a systematic theory for high dimensional analysis of variance in multivariate linear regression, where the dimension and the number of coefficients can both grow with the sample size. We propose a new \emph{U}~type test statistic to test linear hypotheses and establish a high dimensional Gaussian approximation result under fairly mild moment assumptions. Our general framework and theory can be applied to deal with the classical one-way multivariate ANOVA and the nonparametric one-way MANOVA in high dimensions. To implement the test procedure in practice, we introduce a sample-splitting based estimator of the second moment of the error covariance and discuss its properties. A simulation study shows that our proposed test outperforms some existing tests in various settings.

preprint2022arXiv

LinDA: linear models for differential abundance analysis of microbiome compositional data

Differential abundance analysis is at the core of statistical analysis of microbiome data. The compositional nature of microbiome sequencing data makes false positive control challenging. Here, we show that the compositional effects can be addressed by a simple, yet highly flexible and scalable, approach. The proposed method, LinDA, only requires fitting linear regression models on the centered log-ratio transformed data, and correcting the bias due to compositional effects. We show that LinDA enjoys asymptotic FDR control and can be extended to mixed-effect models for correlated microbiome data. Using simulations and real examples, we demonstrate the effectiveness of LinDA.

preprint2021arXiv

Kernel-Distance-Based Covariate Balancing

A common concern in observational studies focuses on properly evaluating the causal effect, which usually refers to the average treatment effect or the average treatment effect on the treated. In this paper, we propose a data preprocessing method, the Kernel-distance-based covariate balancing, for observational studies with binary treatments. This proposed method yields a set of unit weights for the treatment and control groups, respectively, such that the reweighted covariate distributions can satisfy a set of pre-specified balance conditions. This preprocessing methodology can effectively reduce confounding bias of subsequent estimation of causal effects. We demonstrate the implementation and performance of Kernel-distance-based covariate balancing with Monte Carlo simulation experiments and a real data analysis.

preprint2020arXiv

Covariate Adaptive False Discovery Rate Control with Applications to Omics-Wide Multiple Testing

Conventional multiple testing procedures often assume hypotheses for different features are exchangeable. However, in many scientific applications, additional covariate information regarding the patterns of signals and nulls are available. In this paper, we introduce an FDR control procedure in large-scale inference problem that can incorporate covariate information. We develop a fast algorithm to implement the proposed procedure and prove its asymptotic validity even when the underlying model is misspecified and the p-values are weakly dependent (e.g., strong mixing). Extensive simulations are conducted to study the finite sample performance of the proposed method and we demonstrate that the new approach improves over the state-of-the-art approaches by being flexible, robust, powerful and computationally efficient. We finally apply the method to several omics datasets arising from genomics studies with the aim to identify omics features associated with some clinical and biological phenotypes. We show that the method is overall the most powerful among competing methods, especially when the signal is sparse. The proposed Covariate Adaptive Multiple Testing procedure is implemented in the R package CAMT.

preprint2016arXiv

Simultaneous Inference for High-dimensional Linear Models

This paper proposes a bootstrap-assisted procedure to conduct simultaneous inference for high dimensional sparse linear models based on the recent de-sparsifying Lasso estimator (van de Geer et al. 2014). Our procedure allows the dimension of the parameter vector of interest to be exponentially larger than sample size, and it automatically accounts for the dependence within the de-sparsifying Lasso estimator. Moreover, our simultaneous testing method can be naturally coupled with the margin screening (Fan and Lv 2008) to enhance its power in sparse testing with a reduced computational cost, or with the step-down method (Romano and Wolf 2005) to provide a strong control for the family-wise error rate. In theory, we prove that our simultaneous testing procedure asymptotically achieves the pre-specified significance level, and enjoys certain optimality in terms of its power even when the model errors are non-Gaussian. Our general theory is also useful in studying the support recovery problem. To broaden the applicability, we further extend our main results to generalized linear models with convex loss functions. The effectiveness of our methods is demonstrated via simulation studies.

preprint2015arXiv

Testing High Dimensional Mean Under Sparsity

Motivated by the likelihood ratio test under the Gaussian assumption, we develop a maximum sum-of-squares test for conducting hypothesis testing on high dimensional mean vector. The proposed test which incorporates the dependence among the variables is designed to ease the computational burden and to maximize the asymptotic power in the likelihood ratio test. A simulation-based approach is developed to approximate the sampling distribution of the test statistic. The validity of the testing procedure is justified under both the null and alternative hypotheses. We further extend the main results to the two sample problem without the equal covariance assumption. Numerical results suggest that the proposed test can be more powerful than some existing alternatives.

preprint2015arXiv

Two sample inference for the second-order property of temporally dependent functional data

Motivated by the need to statistically quantify the difference between two spatio-temporal datasets that arise in climate downscaling studies, we propose new tests to detect the differences of the covariance operators and their associated characteristics of two functional time series. Our two sample tests are constructed on the basis of functional principal component analysis and self-normalization, the latter of which is a new studentization technique recently developed for the inference of a univariate time series. Compared to the existing tests, our SN-based tests allow for weak dependence within each sample and it is robust to the dependence between the two samples in the case of equal sample sizes. Asymptotic properties of the SN-based test statistics are derived under both the null and local alternatives. Through extensive simulations, our SN-based tests are shown to outperform existing alternatives in size and their powers are found to be respectable. The tests are then applied to the gridded climate model outputs and interpolated observations to detect the difference in their spatial dynamics.

preprint2014arXiv

Bootstrapping High Dimensional Time Series

This article studies bootstrap inference for high dimensional weakly dependent time series in a general framework of approximately linear statistics. The following high dimensional applications are covered: (1) uniform confidence band for mean vector; (2) specification testing on the second order property of time series such as white noise testing and bandedness testing of covariance matrix; (3) specification testing on the spectral property of time series. In theory, we first derive a Gaussian approximation result for the maximum of a sum of weakly dependent vectors, where the dimension of the vectors is allowed to be exponentially larger than the sample size. In particular, we illustrate an interesting interplay between dependence and dimensionality, and also discuss one type of "dimension free" dependence structure. We further propose a blockwise multiplier (wild) bootstrap that works for time series with unknown autocovariance structure. These distributional approximation errors, which are finite sample valid, decrease polynomially in sample size. A non-overlapping block bootstrap is also studied as a more flexible alternative. The above results are established under the general physical/functional dependence framework proposed in Wu (2005). Our work can be viewed as a substantive extension of Chernozhukov et al. (2013) to time series based on a variant of Stein's method developed therein.

preprint2014arXiv

On the Coverage Bound Problem of Empirical Likelihood Methods For Time Series

The upper bounds on the coverage probabilities of the confidence regions based on blockwise empirical likelihood [Kitamura (1997)] and nonstandard expansive empirical likelihood [Nordman et al. (2013)] methods for time series data are investigated via studying the probability for the violation of the convex hull constraint. The large sample bounds are derived on the basis of the pivotal limit of the blockwise empirical log-likelihood ratio obtained under the fixed-b asymptotics, which has been recently shown to provide a more accurate approximation to the finite sample distribution than the conventional chi-square approximation. Our theoretical and numerical findings suggest that both the finite sample and large sample upper bounds for coverage probabilities are strictly less than one and the blockwise empirical likelihood confidence region can exhibit serious undercoverage when (i) the dimension of moment conditions is moderate or large; (ii) the time series dependence is positively strong; or (iii) the block size is large relative to sample size. A similar finite sample coverage problem occurs for the nonstandard expansive empirical likelihood. To alleviate the coverage bound problem, we propose to penalize both empirical likelihood methods by relaxing the convex hull constraint. Numerical simulations and data illustration demonstrate the effectiveness of our proposed remedies in terms of delivering confidence sets with more accurate coverage.

preprint2013arXiv

Fixed-smoothing asymptotics for time series

In this paper, we derive higher order Edgeworth expansions for the finite sample distributions of the subsampling-based t-statistic and the Wald statistic in the Gaussian location model under the so-called fixed-smoothing paradigm. In particular, we show that the error of asymptotic approximation is at the order of the reciprocal of the sample size and obtain explicit forms for the leading error terms in the expansions. The results are used to justify the second-order correctness of a new bootstrap method, the Gaussian dependent bootstrap, in the context of Gaussian location model.

Xianyang Zhang

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

A Modern Theory for High-dimensional Cox Regression Models

Fair Regression under Demographic Parity: A Unified Framework

High Dimensional Analysis of Variance in Multivariate Linear Regression

LinDA: linear models for differential abundance analysis of microbiome compositional data

Kernel-Distance-Based Covariate Balancing

Covariate Adaptive False Discovery Rate Control with Applications to Omics-Wide Multiple Testing

Simultaneous Inference for High-dimensional Linear Models

Testing High Dimensional Mean Under Sparsity

Two sample inference for the second-order property of temporally dependent functional data

Bootstrapping High Dimensional Time Series

On the Coverage Bound Problem of Empirical Likelihood Methods For Time Series

Fixed-smoothing asymptotics for time series