Source author record

Gongjun Xu

Gongjun Xu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology math.ST Statistics Theory math.PR Computation Applications cond-mat.mes-hall Information Theory Machine Learning math.OC

Catalog footprint

What is connected

20works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation

Nonconvex methods have emerged as a dominant approach for low-rank matrix estimation, a problem that arises widely in machine learning and AI for learning and representing high-dimensional data. Existing analyses for these methods often require additional regularization to mitigate nonconvexity, even though such regularization is often unnecessary in practice. Moreover, most analyses rely on problem-specific arguments that are difficult to generalize to more complex settings. In this paper, we develop a theoretical framework for studying nonconvex procedures across a broad class of low-rank matrix estimation problems. Rather than focusing on a specific model, we reveal a fundamental mechanism that explains why nonconvex procedures can behave well in low-rank estimation. Our key device is a {\it benign regularizer} that does not alter the original update rule, but yields an equivalent locally strongly convex formulation of the algorithm. This perspective uncovers a disguised convexity inherent in the nonconvex procedure and provides a new route to theoretical guarantees for nonconvex low-rank matrix estimation.

preprint2026arXiv

Statistical Inference for Covariate-Adjusted and Interpretable Generalized Factor Model with Application to Testing Fairness

Latent variable models are popularly used to measure latent factors (e.g., abilities and personalities) from large-scale assessment data. Beyond understanding these latent factors, the covariate effect on responses controlling for latent factors is also of great scientific interest and has wide applications, such as evaluating the fairness of educational testing, where the covariate effect reflects whether a test question is biased toward certain individual characteristics (e.g., gender and race), taking into account their latent abilities. However, the large sample sizes and test lengths pose challenges to developing efficient methods and drawing valid inferences. Moreover, to accommodate the commonly encountered discrete responses, nonlinear latent factor models are often assumed, adding further complexity. To address these challenges, we consider a covariate-adjusted generalized factor model and develop novel and interpretable conditions to address the identifiability issue. Based on the identifiability conditions, we propose a joint maximum likelihood estimation method and establish estimation consistency and asymptotic normality results for the covariate effects. Furthermore, we derive estimation and inference results for latent factors and the factor loadings. We illustrate the finite sample performance of the proposed method through extensive numerical studies and an educational assessment dataset from the Programme for International Student Assessment (PISA).

preprint2023arXiv

Statistical Inference for Noisy Incomplete Binary Matrix

We consider the statistical inference for noisy incomplete binary (or 1-bit) matrix. Despite the importance of uncertainty quantification to matrix completion, most of the categorical matrix completion literature focuses on point estimation and prediction. This paper moves one step further toward the statistical inference for binary matrix completion. Under a popular nonlinear factor analysis model, we obtain a point estimator and derive its asymptotic normality. Moreover, our analysis adopts a flexible missing-entry design that does not require a random sampling scheme as required by most of the existing asymptotic results for matrix completion. Under reasonable conditions, the proposed estimator is statistically efficient and optimal in the sense that the Cramer-Rao lower bound is achieved asymptotically for the model parameters. Two applications are considered, including (1) linking two forms of an educational test and (2) linking the roll call voting records from multiple years in the United States Senate. The first application enables the comparison between examinees who took different test forms, and the second application allows us to compare the liberal-conservativeness of senators who did not serve in the Senate at the same time.

preprint2022arXiv

Adaptive Tests for Bandedness of High-dimensional Covariance Matrices

Estimation of the high-dimensional banded covariance matrix is widely used in multivariate statistical analysis. To ensure the validity of estimation, we aim to test the hypothesis that the covariance matrix is banded with a certain bandwidth under the high-dimensional framework. Though several testing methods have been proposed in the literature, the existing tests are only powerful for some alternatives with certain sparsity levels, whereas they may not be powerful for alternatives with other sparsity structures. The goal of this paper is to propose a new test for the bandedness of high-dimensional covariance matrix, which is powerful for alternatives with various sparsity levels. The proposed new test also be used for testing the banded structure of covariance matrices of error vectors in high-dimensional factor models. Based on these statistics, a consistent bandwidth estimator is also introduced for a banded high dimensional covariance matrix. Extensive simulation studies and an application to a prostate cancer dataset from protein mass spectroscopy are conducted for evaluating the effectiveness of the proposed adaptive tests blue and bandwidth estimator for the banded covariance matrix.

preprint2022arXiv

Partial-Mastery Cognitive Diagnosis Models

Cognitive diagnosis models (CDMs) are a family of discrete latent attribute models that serve as statistical basis in educational and psychological cognitive diagnosis assessments. CDMs aim to achieve fine-grained inference on individuals' latent attributes, based on their observed responses to a set of designed diagnostic items. In the literature, CDMs usually assume that items require mastery of specific latent attributes and that each attribute is either fully mastered or not mastered by a given subject. We propose a new class of models, partial mastery CDMs (PM-CDMs), that generalizes CDMs by allowing for partial mastery levels for each attribute of interest. We demonstrate that PM-CDMs can be represented as restricted latent class models. Relying on the latent class representation, we propose a Bayesian approach for estimation. We present simulation studies to demonstrate parameter recovery, to investigate the impact of model misspecification with respect to partial mastery, and to develop diagnostic tools that could be used by practitioners to decide between CDMs and PM-CDMs. We use two examples of real test data -- the fraction subtraction and the English tests -- to demonstrate that employing PM-CDMs not only improves model fit, compared to CDMs, but also can make substantial difference in conclusions about attribute mastery. We conclude that PM-CDMs can lead to more effective remediation programs by providing detailed individual-level information about skills learned and skills that need to study.

preprint2022arXiv

Regression Modeling for Recurrent Events Using R Package reReg

Recurrent event analyses have found a wide range of applications in biomedicine, public health, and engineering, among others, where study subjects may experience a sequence of event of interest during follow-up. The R package reReg (Chiou and Huang 2021) offers a comprehensive collection of practical and easy-to-use tools for regression analysis of recurrent events, possibly with the presence of an informative terminal event. The regression framework is a general scale-change model which encompasses the popular Cox-type model, the accelerated rate model, and the accelerated mean model as special cases. Informative censoring is accommodated through a subject-specific frailty without no need for parametric specification. Different regression models are allowed for the recurrent event process and the terminal event. Also included are visualization and simulation tools.

preprint2021arXiv

Hypothesis Testing for Hierarchical Structures in Cognitive Diagnosis Models

Cognitive Diagnosis Models (CDMs) are a special family of discrete latent variable models widely used in educational, psychological and social sciences. In many applications of CDMs, certain hierarchical structures among the latent attributes are assumed by researchers to characterize their dependence structure. Specifically, a directed acyclic graph is used to specify hierarchical constraints on the allowable configurations of the discrete latent attributes. In this paper, we consider the important yet unaddressed problem of testing the existence of latent hierarchical structures in CDMs. We first introduce the concept of testability of hierarchical structures in CDMs and present sufficient conditions. Then we study the asymptotic behaviors of the likelihood ratio test (LRT) statistic, which is widely used for testing nested models. Due to the irregularity of the problem, the asymptotic distribution of LRT becomes nonstandard and tends to provide unsatisfactory finite sample performance under practical conditions. We provide statistical insights on such failures, and propose to use parametric bootstrap to perform the testing. We also demonstrate the effectiveness and superiority of parametric bootstrap for testing the latent hierarchies over non-parametric bootstrap and the naïve Chi-squared test through comprehensive simulations and an educational assessment dataset.

preprint2021arXiv

Sequential Gibbs Sampling Algorithm for Cognitive Diagnosis Models with Many Attributes

Cognitive diagnosis models (CDMs) are useful statistical tools to provide rich information relevant for intervention and learning. As a popular approach to estimate and make inference of CDMs, the Markov chain Monte Carlo (MCMC) algorithm is widely used in practice. However, when the number of attributes, $K$, is large, the existing MCMC algorithm may become time-consuming, due to the fact that $O(2^K)$ calculations are usually needed in the process of MCMC sampling to get the conditional distribution for each attribute profile. To overcome this computational issue, motivated by Culpepper and Hudson (2018), we propose a computationally efficient sequential Gibbs sampling method, which needs $O(K)$ calculations to sample each attribute profile. We use simulation and real data examples to show the good finite-sample performance of the proposed sequential Gibbs sampling, and its advantage over existing methods.

preprint2020arXiv

Asymptotically Independent U-Statistics in High-Dimensional Testing

Many high-dimensional hypothesis tests aim to globally examine marginal or low-dimensional features of a high-dimensional joint distribution, such as testing of mean vectors, covariance matrices and regression coefficients. This paper constructs a family of U-statistics as unbiased estimators of the $\ell_p$-norms of those features. We show that under the null hypothesis, the U-statistics of different finite orders are asymptotically independent and normally distributed. Moreover, they are also asymptotically independent with the maximum-type test statistic, whose limiting distribution is an extreme value distribution. Based on the asymptotic independence property, we propose an adaptive testing procedure which combines $p$-values computed from the U-statistics of different orders. We further establish power analysis results and show that the proposed adaptive procedure maintains high power against various alternatives.

preprint2020arXiv

Observation of the polaronic character of excitons in a two-dimensional semiconducting magnet $\mathrm{CrI_3}$

Exciton dynamics can be strongly affected by lattice vibrations through electron-phonon coupling. This is rarely explored in two-dimensional magnetic semiconductors. Focusing on bilayer CrI3, we first show the presence of strong electron-phonon coupling through temperature-dependent photoluminescence and absorption spectroscopy. We then report the observation of periodic broad modes up to the 8th order in Raman spectra, attributed to the polaronic character of excitons. We establish that this polaronic character is dominated by the coupling between the charge-transfer exciton at 1.96 eV and a longitudinal optical phonon at 120.6 cm-1. We further show that the emergence of long-range magnetic order enhances the electron-phonon coupling strength by about 50$\%$ and that the transition from layered antiferromagnetic to ferromagnetic order tunes the spectral intensity of the periodic broad modes, suggesting a strong coupling among the lattice, charge and spin in two-dimensional CrI3. Our study opens opportunities for tailoring light-matter interactions in two-dimensional magnetic semiconductors.

preprint2020arXiv

On the Phase Transition of Wilk's Phenomenon

Wilk's theorem, which offers universal chi-squared approximations for likelihood ratio tests, is widely used in many scientific hypothesis testing problems. For modern datasets with increasing dimension, researchers have found that the conventional Wilk's phenomenon of the likelihood ratio test statistic often fails. Although new approximations have been proposed in high dimensional settings, there still lacks a clear statistical guideline regarding how to choose between the conventional and newly proposed approximations, especially for moderate-dimensional data. To address this issue, we develop the necessary and sufficient phase transition conditions for Wilk's phenomenon under popular tests on multivariate mean and covariance structures. Moreover, we provide an in-depth analysis of the accuracy of chi-squared approximations by deriving their asymptotic biases. These results may provide helpful insights into the use of chi-squared approximations in scientific practices.

preprint2016arXiv

Identifiability of restricted latent class models with binary responses

Statistical latent class models are widely used in social and psychological researches, yet it is often difficult to establish the identifiability of the model parameters. In this paper we consider the identifiability issue of a family of restricted latent class models, where the restriction structures are needed to reflect pre-specified assumptions on the related assessment. We establish the identifiability results in the strict sense and specify which types of restriction structure would give the identifiability of the model parameters. The results not only guarantee the validity of many of the popularly used models, but also provide a guideline for the related experimental design, where in the current applications the design is usually experience based and identifiability is not guaranteed. Theoretically, we develop a new technique to establish the identifiability result, which may be extended to other restricted latent class models.

preprint2016arXiv

Rare-event Analysis for Extremal Eigenvalues of white Wishart matrices

In this paper we consider the extreme behavior of the extremal eigenvalues of white Wishart matrices, which plays an important role in multivariate analysis. In particular, we focus on the case when the dimension of the feature p is much larger than or comparable to the number of observations n, a common situation in modern data analysis. We provide asymptotic approximations and bounds for the tail probabilities of the extremal eigenvalues. Moreover, we construct efficient Monte Carlo simulation algorithms to compute the tail probabilities. Simulation results show that our method has the best performance amongst known approximation approaches, and furthermore provides an efficient and accurate way for evaluating the tail probabilities in practice.

preprint2014arXiv

On the conditional distributions and the efficient simulations of exponential integrals of Gaussian random fields

In this paper, we consider the extreme behavior of a Gaussian random field $f(t)$ living on a compact set $T$. In particular, we are interested in tail events associated with the integral $\int_Te^{f(t)}\,dt$. We construct a (non-Gaussian) random field whose distribution can be explicitly stated. This field approximates the conditional Gaussian random field $f$ (given that $\int_Te^{f(t)}\,dt$ exceeds a large value) in total variation. Based on this approximation, we show that the tail event of $\int_Te^{f(t)}\,dt$ is asymptotically equivalent to the tail event of $\sup_Tγ(t)$ where $γ(t)$ is a Gaussian process and it is an affine function of $f(t)$ and its derivative field. In addition to the asymptotic description of the conditional field, we construct an efficient Monte Carlo estimator that runs in polynomial time of $\log b$ to compute the probability $P(\int_Te^{f(t)}\,dt>b)$ with a prescribed relative accuracy.

preprint2013arXiv

Bootstrapping a Change-Point Cox Model for Survival Data

This paper investigates the (in)-consistency of various bootstrap methods for making inference on a change-point in time in the Cox model with right censored survival data. A criterion is established for the consistency of any bootstrap method. It is shown that the usual nonparametric bootstrap is inconsistent for the maximum partial likelihood estimation of the change-point. A new model-based bootstrap approach is proposed and its consistency established. Simulation studies are carried out to assess the performance of various bootstrap schemes.

preprint2013arXiv

Model Based Bootstrap Methods for Interval Censored Data

We investigate the performance of model based bootstrap methods for constructing point-wise confidence intervals around the survival function with interval censored data. We show that bootstrapping from the nonparametric maximum likelihood estimator of the survival function is inconsistent for both the current status and case 2 interval censoring models. A model based smoothed bootstrap procedure is proposed and shown to be consistent. In addition, simulation studies are conducted to illustrate the (in)-consistency of the bootstrap methods. Our conclusions in the interval censoring model would extend more generally to estimators in regression models that exhibit non-standard rates of convergence.

preprint2013arXiv

Sequential Analysis of Cox Model under Response Dependent Allocation

Sellke and Siegmund (1983) developed the Brownian approximation to the Cox partial likelihood score as a process of calendar time, laying the foundation for group sequential analysis of survival studies. We extend their results to cover situations in which treatment allocations may depend on observed outcomes. The new development makes use of the entry time and calendar time along with the corresponding $σ$-filtrations to handle the natural information accumulation. Large sample properties are established under suitable regularity conditions.

preprint2013arXiv

Theory of self-learning $Q$-matrix

Cognitive assessment is a growing area in psychological and educational measurement, where tests are given to assess mastery/deficiency of attributes or skills. A key issue is the correct identification of attributes associated with items in a test. In this paper, we set up a mathematical framework under which theoretical properties may be discussed. We establish sufficient conditions to ensure that the attributes required by each item are learnable from the data.

preprint2011arXiv

Learning Item-Attribute Relationship in Q-Matrix Based Diagnostic Classification Models

Recent surge of interests in cognitive assessment has led to the developments of novel statistical models for diagnostic classification. Central to many such models is the well-known Q-matrix, which specifies the item-attribute relationship. This paper proposes a principled estimation procedure for the Q-matrix and related model parameters. Desirable theoretic properties are established through large sample analysis. The proposed method also provides a platform under which important statistical issues, such as hypothesis testing and model selection, can be addressed.

preprint2011arXiv

Some Asymptotic Results of Gaussian Random Fields with Varying Mean Functions and the Associated Processes

In this paper, we derive tail approximations of integrals of exponential functions of Gaussian random fields with varying mean functions and approximations of the associated point processes. This study is motivated naturally by multiple applications such as hypothesis testing for spatial models, study of the distribution of Bayesian marginal likelihood and Bayes factor, and financial applications.

Gongjun Xu

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation

Statistical Inference for Covariate-Adjusted and Interpretable Generalized Factor Model with Application to Testing Fairness

Statistical Inference for Noisy Incomplete Binary Matrix

Adaptive Tests for Bandedness of High-dimensional Covariance Matrices

Partial-Mastery Cognitive Diagnosis Models

Regression Modeling for Recurrent Events Using R Package reReg

Hypothesis Testing for Hierarchical Structures in Cognitive Diagnosis Models

Sequential Gibbs Sampling Algorithm for Cognitive Diagnosis Models with Many Attributes

Asymptotically Independent U-Statistics in High-Dimensional Testing

Observation of the polaronic character of excitons in a two-dimensional semiconducting magnet $\mathrm{CrI_3}$

On the Phase Transition of Wilk's Phenomenon

Identifiability of restricted latent class models with binary responses

Rare-event Analysis for Extremal Eigenvalues of white Wishart matrices

On the conditional distributions and the efficient simulations of exponential integrals of Gaussian random fields

Bootstrapping a Change-Point Cox Model for Survival Data

Model Based Bootstrap Methods for Interval Censored Data

Sequential Analysis of Cox Model under Response Dependent Allocation

Theory of self-learning $Q$-matrix

Learning Item-Attribute Relationship in Q-Matrix Based Diagnostic Classification Models

Some Asymptotic Results of Gaussian Random Fields with Varying Mean Functions and the Associated Processes