Source author record

Anru Zhang

Anru Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Methodology Machine Learning Information Theory math.IT Applications Computation math.NA math.PR Numerical Analysis

Catalog footprint

What is connected

14works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

High-Dimensional Statistics: Reflections on Progress and Open Problems

Over the past two decades, the field of high-dimensional statistics has experienced substantial progress, driven largely by technological advances that have dramatically reduced the cost and effort for data collection and storage across a broad range of domains, including biology, medicine, astronomy, and the social and environmental sciences. Modern datasets are increasingly complex, often exhibiting rich dependency, heterogeneity, and other features that challenge traditional statistical methods. In response, high-dimensional statistics has evolved to address more sophisticated estimation and inference problems. This evolution has, in turn, fostered deep connections with and contributions to a wide range of research areas, including optimization, concentration of measure, random matrix theory, information theory, and theoretical computer science. Given the rapid pace of recent developments in high-dimensional statistics, our goal is to synthesize representative advances, highlight common themes and open problems, and point to important works that offer entry points into the field.

preprint2020arXiv

ISLET: Fast and Optimal Low-rank Tensor Regression via Importance Sketching

In this paper, we develop a novel procedure for low-rank tensor regression, namely \emph{\underline{I}mportance \underline{S}ketching \underline{L}ow-rank \underline{E}stimation for \underline{T}ensors} (ISLET). The central idea behind ISLET is \emph{importance sketching}, i.e., carefully designed sketches based on both the responses and low-dimensional structure of the parameter of interest. We show that the proposed method is sharply minimax optimal in terms of the mean-squared error under low-rank Tucker assumptions and under randomized Gaussian ensemble design. In addition, if a tensor is low-rank with group sparsity, our procedure also achieves minimax optimality. Further, we show through numerical study that ISLET achieves comparable or better mean-squared error performance to existing state-of-the-art methods while having substantial storage and run-time advantages including capabilities for parallel and distributed computing. In particular, our procedure performs reliable estimation with tensors of dimension $p = O(10^8)$ and is $1$ or $2$ orders of magnitude faster than baseline methods.

preprint2020arXiv

Rate-Optimal Perturbation Bounds for Singular Subspaces with Applications to High-Dimensional Statistics

Perturbation bounds for singular spaces, in particular Wedin's $\sin Θ$ theorem, are a fundamental tool in many fields including high-dimensional statistics, machine learning, and applied mathematics. In this paper, we establish separate perturbation bounds, measured in both spectral and Frobenius $\sin Θ$ distances, for the left and right singular subspaces. Lower bounds, which show that the individual perturbation bounds are rate-optimal, are also given. The new perturbation bounds are applicable to a wide range of problems. In this paper, we consider in detail applications to low-rank matrix denoising and singular space estimation, high-dimensional clustering, and canonical correlation analysis (CCA). In particular, separate matching upper and lower bounds are obtained for estimating the left and right singular spaces. To the best of our knowledge, this is the first result that gives different optimal rates for the left and right singular spaces under the same perturbation. In addition to these problems, applications to other high-dimensional problems such as community detection in bipartite networks, multidimensional scaling, and cross-covariance matrix estimation are also discussed.

preprint2020arXiv

Sparse and Low-rank Tensor Estimation via Cubic Sketchings

In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.

preprint2020arXiv

Tensor SVD: Statistical and Computational Limits

In this paper, we propose a general framework for tensor singular value decomposition (tensor SVD), which focuses on the methodology and theory for extracting the hidden low-rank structure from high-dimensional tensor data. Comprehensive results are developed on both the statistical and computational limits for tensor SVD. This problem exhibits three different phases according to the signal-to-noise ratio (SNR). In particular, with strong SNR, we show that the classical higher-order orthogonal iteration achieves the minimax optimal rate of convergence in estimation; with weak SNR, the information-theoretical lower bound implies that it is impossible to have consistent estimation in general; with moderate SNR, we show that the non-convex maximum likelihood estimation provides optimal solution, but with NP-hard computational cost; moreover, under the hardness hypothesis of hypergraphic planted clique detection, there are no polynomial-time algorithms performing consistently in general.

preprint2016arXiv

Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data

Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data.

preprint2016arXiv

Regression Analysis for Microbiome Compositional Data

One important problem in microbiome analysis is to identify the bacterial taxa that are associated with a response, where the microbiome data are summarized as the composition of the bacterial taxa at different taxonomic levels. This paper considers regression analysis with such compositional data as covariates. In order to satisfy the subcompositional coherence of the results, linear models with a set of linear constraints on the regression coefficients are introduced. Such models allow regression analysis for subcompositions and include the log-contrast model for compositional covariates as a special case. A penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed. A method is also proposed to obtain de-biased estimates of the regression coefficients that are asymptotically unbiased and have a joint asymptotic multivariate normal distribution. This provides valid confidence intervals of the regression coefficients and can be used to obtain the $p$-values. Simulation results show the validity of the confidence intervals and smaller variances of the de-biased estimates when the linear constraints are imposed. The proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes.

preprint2015arXiv

Inference for High-dimensional Differential Correlation Matrices

Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed.

preprint2015arXiv

Structured Matrix Completion with Applications to Genomic Data Integration

Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival.

preprint2014arXiv

Instrumental Variables Estimation with Some Invalid Instruments and its Application to Mendelian Randomization

Instrumental variables have been widely used for estimating the causal effect between exposure and outcome. Conventional estimation methods require complete knowledge about all the instruments' validity; a valid instrument must not have a direct effect on the outcome and not be related to unmeasured confounders. Often, this is impractical as highlighted by Mendelian randomization studies where genetic markers are used as instruments and complete knowledge about instruments' validity is equivalent to complete knowledge about the involved genes' functions. In this paper, we propose a method for estimation of causal effects when this complete knowledge is absent. It is shown that causal effects are identified and can be estimated as long as less than $50$% of instruments are invalid, without knowing which of the instruments are invalid. We also introduce conditions for identification when the 50% threshold is violated. A fast penalized $\ell_1$ estimation method, called sisVIVE, is introduced for estimating the causal effect without knowing which instruments are valid, with theoretical guarantees on its performance. The proposed method is demonstrated on simulated data and a real Mendelian randomization study concerning the effect of body mass index on health-related quality of life index. An R package \emph{sisVIVE} is available online.

preprint2014arXiv

ROP: Matrix recovery via rank-one projections

Estimation of low-rank matrices is of significant interest in a range of contemporary applications. In this paper, we introduce a rank-one projection model for low-rank matrix recovery and propose a constrained nuclear norm minimization method for stable recovery of low-rank matrices in the noisy case. The procedure is adaptive to the rank and robust against small perturbations. Both upper and lower bounds for the estimation accuracy under the Frobenius norm loss are obtained. The proposed estimator is shown to be rate-optimal under certain conditions. The estimator is easy to implement via convex programming and performs well numerically. The techniques and main results developed in the paper also have implications to other related statistical problems. An application to estimation of spiked covariance matrices from one-dimensional random projections is considered. The results demonstrate that it is still possible to accurately estimate the covariance matrix of a high-dimensional distribution based only on one-dimensional projections.

preprint2013arXiv

Compressed Sensing and Affine Rank Minimization under Restricted Isometry

This paper establishes new restricted isometry conditions for compressed sensing and affine rank minimization. It is shown for compressed sensing that $δ_{k}^A+θ_{k,k}^A < 1$ guarantees the exact recovery of all $k$ sparse signals in the noiseless case through the constrained $\ell_1$ minimization. Furthermore, the upper bound 1 is sharp in the sense that for any $ε> 0$, the condition $δ_k^A + θ_{k, k}^A < 1+ε$ is not sufficient to guarantee such exact recovery using any recovery method. Similarly, for affine rank minimization, if $δ_{r}^\mathcal{M}+θ_{r,r}^\mathcal{M}< 1$ then all matrices with rank at most $r$ can be reconstructed exactly in the noiseless case via the constrained nuclear norm minimization; and for any $ε> 0$, $δ_r^\mathcal{M} +θ_{r,r}^\mathcal{M} < 1+ε$ does not ensure such exact recovery using any method. Moreover, in the noisy case the conditions $δ_{k}^A+θ_{k,k}^A < 1$ and $δ_{r}^\mathcal{M}+θ_{r,r}^\mathcal{M}< 1$ are also sufficient for the stable recovery of sparse signals and low-rank matrices respectively. Applications and extensions are also discussed.

preprint2013arXiv

Sharp RIP Bound for Sparse Signal and Low-Rank Matrix Recovery

This paper establishes a sharp condition on the restricted isometry property (RIP) for both the sparse signal recovery and low-rank matrix recovery. It is shown that if the measurement matrix $A$ satisfies the RIP condition $δ_k^A<1/3$, then all $k$-sparse signals $β$ can be recovered exactly via the constrained $\ell_1$ minimization based on $y=Aβ$. Similarly, if the linear map $\cal M$ satisfies the RIP condition $δ_r^{\cal M}<1/3$, then all matrices $X$ of rank at most $r$ can be recovered exactly via the constrained nuclear norm minimization based on $b={\cal M}(X)$. Furthermore, in both cases it is not possible to do so in general when the condition does not hold. In addition, noisy cases are considered and oracle inequalities are given under the sharp RIP condition.

preprint2013arXiv

Sparse Representation of a Polytope and Recovery of Sparse Signals and Low-rank Matrices

This paper considers compressed sensing and affine rank minimization in both noiseless and noisy cases and establishes sharp restricted isometry conditions for sparse signal and low-rank matrix recovery. The analysis relies on a key technical tool which represents points in a polytope by convex combinations of sparse vectors. The technique is elementary while leads to sharp results. It is shown that for any given constant $t\ge {4/3}$, in compressed sensing $δ_{tk}^A < \sqrt{(t-1)/t}$ guarantees the exact recovery of all $k$ sparse signals in the noiseless case through the constrained $\ell_1$ minimization, and similarly in affine rank minimization $δ_{tr}^\mathcal{M}< \sqrt{(t-1)/t}$ ensures the exact reconstruction of all matrices with rank at most $r$ in the noiseless case via the constrained nuclear norm minimization. Moreover, for any $ε>0$, $δ_{tk}^A<\sqrt{\frac{t-1}{t}}+ε$ is not sufficient to guarantee the exact recovery of all $k$-sparse signals for large $k$. Similar result also holds for matrix recovery. In addition, the conditions $δ_{tk}^A < \sqrt{(t-1)/t}$ and $δ_{tr}^\mathcal{M}< \sqrt{(t-1)/t}$ are also shown to be sufficient respectively for stable recovery of approximately sparse signals and low-rank matrices in the noisy case.

Anru Zhang

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

High-Dimensional Statistics: Reflections on Progress and Open Problems

ISLET: Fast and Optimal Low-rank Tensor Regression via Importance Sketching

Rate-Optimal Perturbation Bounds for Singular Subspaces with Applications to High-Dimensional Statistics

Sparse and Low-rank Tensor Estimation via Cubic Sketchings

Tensor SVD: Statistical and Computational Limits

Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data

Regression Analysis for Microbiome Compositional Data

Inference for High-dimensional Differential Correlation Matrices

Structured Matrix Completion with Applications to Genomic Data Integration

Instrumental Variables Estimation with Some Invalid Instruments and its Application to Mendelian Randomization

ROP: Matrix recovery via rank-one projections

Compressed Sensing and Affine Rank Minimization under Restricted Isometry

Sharp RIP Bound for Sparse Signal and Low-Rank Matrix Recovery

Sparse Representation of a Polytope and Recovery of Sparse Signals and Low-rank Matrices