Researcher profile

Jianfeng Yao

Jianfeng Yao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

An Eigenvalue Ratio Approach to Inferring Population Structure from Whole Genome Sequencing Data

Inference of population structure from genetic data plays an important role in population and medical genetics studies. With the advancement and decreasing cost of sequencing technology, the increasingly available whole genome sequencing data provide much richer information about the underlying population structure. The traditional method (Patterson, Price, and Reich, 2006) originally developed for array-based genotype data for computing and selecting top principal components that capture population structure may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio n/p is nearly zero, violating the assumption of the Tracy-Widom test used in their method. Second, their method might not be able to handle the linkage disequilibrium well in sequencing data. To resolve those two practical issues, we propose a new method called ERStruct to determine the number of top informative principal components based on sequencing data. More specifically, we propose to use the ratio of consecutive eigenvalues as a more robust test statistic, and then we approximate its null distribution using modern random matrix theory. Both simulation studies and applications to two public data sets from the HapMap 3 and the 1000 Genomes Projects demonstrate the empirical performance of our ERStruct method.

preprint2022arXiv

Impact of classification difficulty on the weight matrices spectra in Deep Learning and application to early-stopping

Much research effort has been devoted to explaining the success of deep learning. Random Matrix Theory (RMT) provides an emerging way to this end: spectral analysis of large random matrices involved in a trained deep neural network (DNN) such as weight matrices or Hessian matrices with respect to the stochastic gradient descent algorithm. To have more comprehensive understanding of weight matrices spectra, we conduct extensive experiments on weight matrices in different modules, e.g., layers, networks and data sets. Following the previous work of \cite{martin2018implicit}, we classify the spectra in the terminal stage into three main types: Light Tail (LT), Bulk Transition period (BT) and Heavy Tail(HT). These different types, especially HT, implicitly indicate some regularization in the DNNs. A main contribution from the paper is that we identify the difficulty of the classification problem as a driving factor for the appearance of heavy tail in weight matrices spectra. Higher the classification difficulty, higher the chance for HT to appear. Moreover, the classification difficulty can be affected by the signal-to-noise ratio of the dataset, or by the complexity of the classification problem (complex features, large number of classes) as well. Leveraging on this finding, we further propose a spectral criterion to detect the appearance of heavy tails and use it to early stop the training process without testing data. Such early stopped DNNs have the merit of avoiding overfitting and unnecessary extra training while preserving a much comparable generalization ability. These findings from the paper are validated in several NNs, using Gaussian synthetic data and real data sets (MNIST and CIFAR10).

preprint2021arXiv

On eigenvalues of a high-dimensional spatial-sign covariance matrix

This paper investigates limiting properties of eigenvalues of multivariate sample spatial-sign covariance matrices when both the number of variables and the sample size grow to infinity. The underlying p-variate populations are general enough to include the popular independent components model and the family of elliptical distributions. A first result of the paper establishes that the distribution of the eigenvalues converges to a deterministic limit that belongs to the family of generalized Marcenko-Pastur distributions. Furthermore, a new central limit theorem is established for a class of linear spectral statistics. We develop two applications of these results to robust statistics for a high-dimensional shape matrix. First, two statistics are proposed for testing the sphericity. Next, a spectrum-corrected estimator using the sample spatial-sign covariance matrix is proposed. Simulation experiments show that in high dimension, the sample spatial-sign covariance matrix provides a valid and robust tool for mitigating influence of outliers.

preprint2020arXiv

Eigenvalue distributions of high-dimensional matrix processes driven by fractional Brownian motion

In this article, we study high-dimensional behavior of empirical spectral distributions $\{L_N(t), t\in[0,T]\}$ for a class of $N\times N$ symmetric/Hermitian random matrices, whose entries are generated from the solution of stochastic differential equation driven by fractional Brownian motion with Hurst parameter $H \in(1/2,1)$. For Wigner-type matrices, we obtain almost sure relative compactness of $\{L_N(t), t\in[0,T]\}_{N\in\mathbb N}$ in $C([0,T], \mathbf P(\mathbb R))$ following the approach in \cite{Anderson2010}; for Wishart-type matrices, we obtain tightness of $\{L_N(t), t\in[0,T]\}_{N\in\mathbb N}$ on $C([0,T], \mathbf P(\mathbb R))$ by tightness criterions provided in Appendix \ref{subset:tightness argument}. The limit of $\{L_N(t), t\in[0,T]\}$ as $N\to \infty$ is also characterised.

preprint2020arXiv

On Laplacian spectrum of dendrite trees

For dendrite graphs from biological experiments on mouse's retinal ganglion cells, a paper by Nakatsukasa, Saito and Woei reveals a mysterious phase transition phenomenon in the spectra of the corresponding graph Laplacian matrices. While the bulk of the spectrum can be well understood by structures resembling starlike trees, mysteries about the spikes, that is, isolated eigenvalues outside the bulk spectrum, remain unexplained. In this paper, we bring new insights on these mysteries by considering a class of uniform trees. Exact relationships between the number of such spikes and the number of T-junctions are analyzed in function of the number of vertices separating the T-junctions. Using these theoretic results, predictions are proposed for the number of spikes observed in real-life dendrite graphs. Interestingly enough, these predictions match well the observed numbers of spikes, thus confirm the practical meaningness of our theoretical results.