Source author record

Jianfeng Yao

Jianfeng Yao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory math.PR Methodology Applications Information Theory Machine Learning math.IT q-fin.ST

Catalog footprint

What is connected

19works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

An Eigenvalue Ratio Approach to Inferring Population Structure from Whole Genome Sequencing Data

Inference of population structure from genetic data plays an important role in population and medical genetics studies. With the advancement and decreasing cost of sequencing technology, the increasingly available whole genome sequencing data provide much richer information about the underlying population structure. The traditional method (Patterson, Price, and Reich, 2006) originally developed for array-based genotype data for computing and selecting top principal components that capture population structure may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio n/p is nearly zero, violating the assumption of the Tracy-Widom test used in their method. Second, their method might not be able to handle the linkage disequilibrium well in sequencing data. To resolve those two practical issues, we propose a new method called ERStruct to determine the number of top informative principal components based on sequencing data. More specifically, we propose to use the ratio of consecutive eigenvalues as a more robust test statistic, and then we approximate its null distribution using modern random matrix theory. Both simulation studies and applications to two public data sets from the HapMap 3 and the 1000 Genomes Projects demonstrate the empirical performance of our ERStruct method.

preprint2022arXiv

Impact of classification difficulty on the weight matrices spectra in Deep Learning and application to early-stopping

Much research effort has been devoted to explaining the success of deep learning. Random Matrix Theory (RMT) provides an emerging way to this end: spectral analysis of large random matrices involved in a trained deep neural network (DNN) such as weight matrices or Hessian matrices with respect to the stochastic gradient descent algorithm. To have more comprehensive understanding of weight matrices spectra, we conduct extensive experiments on weight matrices in different modules, e.g., layers, networks and data sets. Following the previous work of \cite{martin2018implicit}, we classify the spectra in the terminal stage into three main types: Light Tail (LT), Bulk Transition period (BT) and Heavy Tail(HT). These different types, especially HT, implicitly indicate some regularization in the DNNs. A main contribution from the paper is that we identify the difficulty of the classification problem as a driving factor for the appearance of heavy tail in weight matrices spectra. Higher the classification difficulty, higher the chance for HT to appear. Moreover, the classification difficulty can be affected by the signal-to-noise ratio of the dataset, or by the complexity of the classification problem (complex features, large number of classes) as well. Leveraging on this finding, we further propose a spectral criterion to detect the appearance of heavy tails and use it to early stop the training process without testing data. Such early stopped DNNs have the merit of avoiding overfitting and unnecessary extra training while preserving a much comparable generalization ability. These findings from the paper are validated in several NNs, using Gaussian synthetic data and real data sets (MNIST and CIFAR10).

preprint2021arXiv

On eigenvalues of a high-dimensional spatial-sign covariance matrix

This paper investigates limiting properties of eigenvalues of multivariate sample spatial-sign covariance matrices when both the number of variables and the sample size grow to infinity. The underlying p-variate populations are general enough to include the popular independent components model and the family of elliptical distributions. A first result of the paper establishes that the distribution of the eigenvalues converges to a deterministic limit that belongs to the family of generalized Marcenko-Pastur distributions. Furthermore, a new central limit theorem is established for a class of linear spectral statistics. We develop two applications of these results to robust statistics for a high-dimensional shape matrix. First, two statistics are proposed for testing the sphericity. Next, a spectrum-corrected estimator using the sample spatial-sign covariance matrix is proposed. Simulation experiments show that in high dimension, the sample spatial-sign covariance matrix provides a valid and robust tool for mitigating influence of outliers.

preprint2020arXiv

Eigenvalue distributions of high-dimensional matrix processes driven by fractional Brownian motion

In this article, we study high-dimensional behavior of empirical spectral distributions $\{L_N(t), t\in[0,T]\}$ for a class of $N\times N$ symmetric/Hermitian random matrices, whose entries are generated from the solution of stochastic differential equation driven by fractional Brownian motion with Hurst parameter $H \in(1/2,1)$. For Wigner-type matrices, we obtain almost sure relative compactness of $\{L_N(t), t\in[0,T]\}_{N\in\mathbb N}$ in $C([0,T], \mathbf P(\mathbb R))$ following the approach in \cite{Anderson2010}; for Wishart-type matrices, we obtain tightness of $\{L_N(t), t\in[0,T]\}_{N\in\mathbb N}$ on $C([0,T], \mathbf P(\mathbb R))$ by tightness criterions provided in Appendix \ref{subset:tightness argument}. The limit of $\{L_N(t), t\in[0,T]\}$ as $N\to \infty$ is also characterised.

preprint2020arXiv

On Laplacian spectrum of dendrite trees

For dendrite graphs from biological experiments on mouse's retinal ganglion cells, a paper by Nakatsukasa, Saito and Woei reveals a mysterious phase transition phenomenon in the spectra of the corresponding graph Laplacian matrices. While the bulk of the spectrum can be well understood by structures resembling starlike trees, mysteries about the spikes, that is, isolated eigenvalues outside the bulk spectrum, remain unexplained. In this paper, we bring new insights on these mysteries by considering a class of uniform trees. Exact relationships between the number of such spikes and the number of T-junctions are analyzed in function of the number of vertices separating the T-junctions. Using these theoretic results, predictions are proposed for the number of spikes observed in real-life dendrite graphs. Interestingly enough, these predictions match well the observed numbers of spikes, thus confirm the practical meaningness of our theoretical results.

preprint2016arXiv

Gaussian fluctuations for linear spectral statistics of large random covariance matrices

Consider a $N\times n$ matrix $Σ_n=\frac{1}{\sqrt{n}}R_n^{1/2}X_n$, where $R_n$ is a nonnegative definite Hermitian matrix and $X_n$ is a random matrix with i.i.d. real or complex standardized entries. The fluctuations of the linear statistics of the eigenvalues \[\operatorname {Trace}f \bigl(Σ_nΣ_n^*\bigr)=\sum_{i=1}^Nf(λ_i),\qquad (λ_i)\ eigenvalues\ of\ Σ_nΣ_n^*,\] are shown to be Gaussian, in the regime where both dimensions of matrix $Σ_n$ go to infinity at the same pace and in the case where $f$ is of class $C^3$, that is, has three continuous derivatives. The main improvements with respect to Bai and Silverstein's CLT [Ann. Probab. 32 (2004) 553-605] are twofold: First, we consider general entries with finite fourth moment, but whose fourth cumulant is nonnull, that is, whose fourth moment may differ from the moment of a (real or complex) Gaussian random variable. As a consequence, extra terms proportional to $ \vert \mathcal{V}\vert ^2=\bigl|\mathbb{E}\bigl(X_{11}^n\bigr) ^2\bigr|^2$ and $κ=\mathbb{E}\bigl \vert X_{11}^n\bigr \vert ^4-\vert {\mathcal{V}}\vert ^2-2$ appear in the limiting variance and in the limiting bias, which not only depend on the spectrum of matrix $R_n$ but also on its eigenvectors. Second, we relax the analyticity assumption over $f$ by representing the linear statistics with the help of Helffer-Sjöstrand's formula. The CLT is expressed in terms of vanishing Lévy-Prohorov distance between the linear statistics' distribution and a Gaussian probability distribution, the mean and the variance of which depend upon $N$ and $n$ and may not converge.

preprint2016arXiv

On the Surprising Explanatory Power of Higher Realized Moments in Practice

Realized moments of higher order computed from intraday returns are introduced in recent years. The literature indicates that realized skewness is an important factor in explaining future asset returns. However, the literature mainly focuses on the whole market and on the monthly or weekly scale. In this paper, we conduct an extensive empirical analysis to investigate the forecasting abilities of realized skewness and realized kurtosis towards individual stock's future return and variance in the daily scale. It is found that realized kurtosis possesses significant forecasting power for the stock's future variance. In the meanwhile, realized skewness is lack of explanatory power for the future daily return for individual stocks with a short horizon, in contrast with the existing literature.

preprint2016arXiv

Testing the Sphericity of a covariance matrix when the dimension is much larger than the sample size

This paper focuses on the prominent sphericity test when the dimension $p$ is much lager than sample size $n$. The classical likelihood ratio test(LRT) is no longer applicable when $p\gg n$. Therefore a Quasi-LRT is proposed and asymptotic distribution of the test statistic under the null when $p/n\rightarrow\infty, n\rightarrow\infty$ is well established in this paper. Meanwhile, John's test has been found to possess the powerful {\it dimension-proof} property, which keeps exactly the same limiting distribution under the null with any $(n,p)$-asymptotic, i.e. $p/n\rightarrow[0,\infty]$, $n\rightarrow\infty$. All asymptotic results are derived for general population with finite fourth order moment. Numerical experiments are implemented for comparison.

preprint2015arXiv

Forecasting High-Dimensional Realized Volatility Matrices Using A Factor Model

Modeling and forecasting covariance matrices of asset returns play a crucial role in finance. The availability of high frequency intraday data enables the modeling of the realized covariance matrix directly. However, most models in the literature suffer from the curse of dimensionality. To solve the problem, we propose a factor model with a diagonal CAW model for the factor realized covariance matrices. Asymptotic theory is derived for the estimated parameters. In an extensive empirical analysis, we find that the number of parameters can be reduced significantly. Furthermore, the proposed model maintains a comparable performance with a benchmark vector autoregressive model.

preprint2014arXiv

CLT for large dimensional general Fisher matrices and its applications in high-dimensional data analysis

Random Fisher matrices arise naturally in multivariate statistical analysis and understanding the properties of its eigenvalues is of primary importance for many hypothesis testing problems like testing the equality between two multivariate population covariance matrices, or testing the independence between sub-groups of a multivariate random vector. This paper is concerned with the properties of a large-dimensional Fisher matrix when the dimension of the population is proportionally large compared to the sample size. Most of existing works on Fisher matrices deal with a particular Fisher matrix where populations have i.i.d components so that the population covariance matrices are all identity. In this paper, we consider general Fisher matrices with arbitrary population covariance matrices. The first main result of the paper establishes the limiting distribution of the eigenvalues of a Fisher matrix while in a second main result, we provide a central limit theorem for a wide class of functionals of its eigenvalues. Some applications of these results are also proposed for testing hypotheses on high-dimensional covariance matrices.

preprint2014arXiv

Joint CLT for several random sesquilinear forms with applications to large-dimensional spiked population models

In this paper, we derive a joint central limit theorem for random vector whose components are function of random sesquilinear forms. This result is a natural extension of the existing central limit theory on random quadratic forms. We also provide applications in random matrix theory related to large-dimensional spiked population models. For the first application, we find the joint distribution of grouped extreme sample eigenvalues correspond to the spikes. And for the second application, under the assumption that the population covariance matrix is diagonal with $k$ (fixed) simple spikes, we derive the asymptotic joint distribution of the extreme sample eigenvalue and its corresponding sample eigenvector projection.

preprint2014arXiv

On singular value distribution of large dimensional auto-covariance matrices

Let $(\varepsilon_j)_{j\geq 0}$ be a sequence of independent $p-$dimensional random vectors and $τ\geq1$ a given integer. From a sample $\varepsilon_1,\cdots,\varepsilon_{T+τ-1},\varepsilon_{T+τ}$ of the sequence, the so-called lag $-τ$ auto-covariance matrix is $C_τ=T^{-1}\sum_{j=1}^T\varepsilon_{τ+j}\varepsilon_{j}^t$. When the dimension $p$ is large compared to the sample size $T$, this paper establishes the limit of the singular value distribution of $C_τ$ assuming that $p$ and $T$ grow to infinity proportionally and the sequence satisfies a Lindeberg condition on fourth order moments. Compared to existing asymptotic results on sample covariance matrices developed in random matrix theory, the case of an auto-covariance matrix is much more involved due to the fact that the summands are dependent and the matrix $C_τ$ is not symmetric. Several new techniques are introduced for the derivation of the main theorem.

preprint2013arXiv

A local moment estimator of the spectrum of a large dimensional covariance matrix

This paper considers the problem of estimating the population spectral distribution from a sample covariance matrix in large dimensional situations. We generalize the contour-integral based method in Mestre (2008) and present a local moment estimation procedure. Compared with the original one, the new procedure can be applied successfully to models where the asymptotic clusters of sample eigenvalues generated by different population eigenvalues are not all separate. The proposed estimates are proved to be consistent. Numerical results illustrate the implementation of the estimation procedure and demonstrate its efficiency in various cases.

preprint2013arXiv

A note on the CLT of the LSS for sample covariance matrix from a spiked population model

In this note, we establish an asymptotic expansion for the centering parameter appearing in the central limit theorems for linear spectral statistic of large-dimensional sample covariance matrices when the population has a spiked covariance structure. As an application, we provide an asymptotic power function for the corrected likelihood ratio statistic for testing the presence of spike eigenvalues in the population covariance matrix. This result generalizes an existing formula from the literature where only one simple spike exists.

preprint2013arXiv

CLT for linear spectral statistics of random matrix $S^{-1}T$

This paper proposes a CLT for linear spectral statistics of random matrix $S^{-1}T$ for a general non-negative definite and {\bf non-random} Hermitian matrix $T$.

preprint2013arXiv

Estimation of the population spectral distribution from a large dimensional sample covariance matrix

This paper introduces a new method to estimate the spectral distribution of a population covariance matrix from high-dimensional data. The method is founded on a meaningful generalization of the seminal Marcenko-Pastur equation, originally defined in the complex plan, to the real line. Beyond its easy implementation and the established asymptotic consistency, the new estimator outperforms two existing estimators from the literature in almost all the situations tested in a simulation experiment. An application to the analysis of the correlation matrix of S&P stocks data is also given.

preprint2012arXiv

Estimation of the Covariance Matrix of Large Dimensional Data

This paper deals with the problem of estimating the covariance matrix of a series of independent multivariate observations, in the case where the dimension of each observation is of the same order as the number of observations. Although such a regime is of interest for many current statistical signal processing and wireless communication issues, traditional methods fail to produce consistent estimators and only recently results relying on large random matrix theory have been unveiled. In this paper, we develop the parametric framework proposed by Mestre, and consider a model where the covariance matrix to be estimated has a (known) finite number of eigenvalues, each of it with an unknown multiplicity. The main contributions of this work are essentially threefold with respect to existing results, and in particular to Mestre's work: To relax the (restrictive) separability assumption, to provide joint consistent estimates for the eigenvalues and their multiplicities, and to study the variance error by means of a Central Limit theorem.

preprint2011arXiv

A note on a Marčenko-Pastur type theorem for time series

In this note we develop an extension of the Marčenko-Pastur theorem to time series model with temporal correlations. The limiting spectral distribution (LSD) of the sample covariance matrix is characterised by an explicit equation for its Stieltjes transform depending on the spectral density of the time series. A numerical algorithm is then given to compute the density functions of these LSD's.

preprint2011arXiv

Fluctuations of an improved population eigenvalue estimator in sample covariance matrix models

This article provides a central limit theorem for a consistent estimator of population eigenvalues with large multiplicities based on sample covariance matrices. The focus is on limited sample size situations, whereby the number of available observations is known and comparable in magnitude to the observation dimension. An exact expression as well as an empirical, asymptotically accurate, approximation of the limiting variance is derived. Simulations are performed that corroborate the theoretical claims. A specific application to wireless sensor networks is developed.

Jianfeng Yao

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

An Eigenvalue Ratio Approach to Inferring Population Structure from Whole Genome Sequencing Data

Impact of classification difficulty on the weight matrices spectra in Deep Learning and application to early-stopping

On eigenvalues of a high-dimensional spatial-sign covariance matrix

Eigenvalue distributions of high-dimensional matrix processes driven by fractional Brownian motion

On Laplacian spectrum of dendrite trees

Gaussian fluctuations for linear spectral statistics of large random covariance matrices

On the Surprising Explanatory Power of Higher Realized Moments in Practice

Testing the Sphericity of a covariance matrix when the dimension is much larger than the sample size

Forecasting High-Dimensional Realized Volatility Matrices Using A Factor Model

CLT for large dimensional general Fisher matrices and its applications in high-dimensional data analysis

Joint CLT for several random sesquilinear forms with applications to large-dimensional spiked population models

On singular value distribution of large dimensional auto-covariance matrices

A local moment estimator of the spectrum of a large dimensional covariance matrix

A note on the CLT of the LSS for sample covariance matrix from a spiked population model

CLT for linear spectral statistics of random matrix $S^{-1}T$

Estimation of the population spectral distribution from a large dimensional sample covariance matrix

Estimation of the Covariance Matrix of Large Dimensional Data

A note on a Marčenko-Pastur type theorem for time series

Fluctuations of an improved population eigenvalue estimator in sample covariance matrix models