Source author record

Zheng Tracy Ke

Zheng Tracy Ke appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology math.ST Statistics Theory Machine Learning Applications Digital Libraries Social and Information Networks

Catalog footprint

What is connected

15works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Comparison of Hamming Errors of Representative Variable Selection Methods

Lasso is a celebrated method for variable selection in linear models, but it faces challenges when the variables are moderately or strongly correlated. This motivates alternative approaches such as using a non-convex penalty, adding a ridge regularization, or conducting a post-Lasso thresholding. In this paper, we compare Lasso with 5 other methods: Elastic net, SCAD, forward selection, thresholded Lasso, and forward backward selection. We measure their performances theoretically by the expected Hamming error, assuming that the regression coefficients are iid drawn from a two-point mixture and that the Gram matrix is block-wise diagonal. By deriving the rates of convergence of Hamming errors and the phase diagrams, we obtain useful conclusions about the pros and cons of different methods.

preprint2022arXiv

Allocation of COVID-19 Testing Budget on a Commute Network of Counties

The screening testing is an effective tool to control the early spread of an infectious disease such as COVID-19. When the total testing capacity is limited, we aim to optimally allocate testing resources among n counties. We build a (weighted) commute network on counties, with the weight between two counties a decreasing function of their traffic distance. We introduce a network-based disease model, in which the number of newly confirmed cases of each county depends on the numbers of hidden cases of all counties on the network. Our proposed testing allocation strategy first uses historical data to learn model parameters and then decides the testing rates for all counties by solving an optimization problem. We apply the method on the commute networks of Massachusetts, USA and Hubei, China and observe its advantages over testing allocation strategies that ignore the network structure. Our approach can also be extended to study the vaccine allocation problem.

preprint2022arXiv

Co-citation and Co-authorship Networks of Statisticians

We collected and cleaned a large data set on publications in statistics. The data set consists of the coauthor relationships and citation relationships of 83, 331 papers published in 36 representative journals in statistics, probability, and machine learning, spanning 41 years. The data set allows us to construct many different networks, and motivates a number of research problems about the research patterns and trends, research impacts, and network topology of the statistics community. In this paper we focus on (i) using the citation relationships to estimate the research interests of authors, and (ii) using the coauthor relationships to study the network topology. Using co-citation networks we constructed, we discover a "statistics triangle", reminiscent of the statistical philosophy triangle (Efron, 1998). We propose new approaches to constructing the "research map" of statisticians, as well as the "research trajectory" for a given author to visualize his/her research interest evolvement. Using co-authorship networks we constructed, we discover a multi-layer community tree and produce a Sankey diagram to visualize the author migrations in different sub-areas. We also propose several new metrics for research diversity of individual authors. We find that "Bayes", "Biostatistics", and "Nonparametric" are three primary areas in statistics. We also identify 15 sub-areas, each of which can be viewed as a weighted average of the primary areas, and identify several underlying reasons for the formation of co-authorship communities. We also find that the research interests of statisticians have evolved significantly in the 41-year time window we studied: some areas (e.g., biostatistics, high-dimensional data analysis, etc.) have become increasingly more popular.

preprint2022arXiv

Optimal Estimation of the Number of Communities

In network analysis, how to estimate the number of communities $K$ is a fundamental problem. We consider a broad setting where we allow severe degree heterogeneity and a wide range of sparsity levels, and propose Stepwise Goodness-of-Fit (StGoF) as a new approach. This is a stepwise algorithm, where for $m = 1, 2, \ldots$, we alternately use a community detection step and a goodness-of-fit (GoF) step. We adapt SCORE \cite{SCORE} for community detection, and propose a new GoF metric. We show that at step $m$, the GoF metric diverges to $\infty$ in probability for all $m < K$ and converges to $N(0,1)$ if $m = K$. This gives rise to a consistent estimate for $K$. Also, we discover the right way to define the signal-to-noise ratio (SNR) for our problem and show that consistent estimates for $K$ do not exist if $\mathrm{SNR} \goto 0$, and StGoF is uniformly consistent for $K$ if $\mathrm{SNR} \goto \infty$. Therefore, StGoF achieves the optimal phase transition. Similar stepwise methods (e.g., \cite{wang2017likelihood, ma2018determining}) are known to face analytical challenges. We overcome the challenges by using a different stepwise scheme in StGoF and by deriving sharp results that are not available before. The key to our analysis is to show that SCORE has the {\it Non-Splitting Property (NSP)}. Primarily due to a non-tractable rotation of eigenvectors dictated by the Davis-Kahan $\sin(θ)$ theorem, the NSP is non-trivial to prove and requires new techniques we develop.

preprint2022arXiv

Power Enhancement and Phase Transitions for Global Testing of the Mixed Membership Stochastic Block Model

The mixed-membership stochastic block model (MMSBM) is a common model for social networks. Given an $n$-node symmetric network generated from a $K$-community MMSBM, we would like to test $K=1$ versus $K>1$. We first study the degree-based $χ^2$ test and the orthodox Signed Quadrilateral (oSQ) test. These two statistics estimate an order-2 polynomial and an order-4 polynomial of a "signal" matrix, respectively. We derive the asymptotic null distribution and power for both tests. However, for each test, there exists a parameter regime where its power is unsatisfactory. It motivates us to propose a power enhancement (PE) test to combine the strengths of both tests. We show that the PE test has a tractable null distribution and improves the power of both tests. To assess the optimality of PE, we consider a randomized setting, where the $n$ membership vectors are independently drawn from a distribution on the standard simplex. We show that the success of global testing is governed by a quantity $β_n(K,P,h)$, which depends on the community structure matrix $P$ and the mean vector $h$ of memberships. For each given $(K, P, h)$, a test is called $\textit{ optimal}$ if it distinguishes two hypotheses when $β_n(K, P,h)\to\infty$. A test is called $\textit{optimally adaptive}$ if it is optimal for all $(K, P, h)$. We show that the PE test is optimally adaptive, while many existing tests are only optimal for some particular $(K, P, h)$, hence, not optimally adaptive.

preprint2022arXiv

The SCORE normalization, especially for highly heterogeneous network and text data

SCORE was introduced as a spectral approach to network community detection. Since many networks have severe degree heterogeneity, the ordinary spectral clustering (OSC) approach to community detection may perform unsatisfactorily. SCORE alleviates the effect of degree heterogeneity by introducing a new normalization idea in the spectral domain and makes OSC more effective. SCORE is easy to use and computationally fast. It adapts easily to new directions and sees an increasing interest in practice. In this paper, we review the basics of SCORE, the adaption of SCORE to network mixed membership estimation and topic modeling, and the application of SCORE in real data, including two datasets on the publications of statisticians. We also review the theoretical 'ideology' underlying SCORE. We show that in the spectral domain, SCORE converts a simplicial cone to a simplex, and provides a simple and direct link between the simplex and network memberships. SCORE attains an exponential rate and a sharp phase transition in community detection, and achieves optimal rates in mixed membership estimation and topic modeling.

preprint2022arXiv

Using SVD for Topic Modeling

The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the corpus matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers.

preprint2021arXiv

Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis

The spiked covariance model has gained increasing popularity in high-dimensional data analysis. A fundamental problem is determination of the number of spiked eigenvalues, $K$. For estimation of $K$, most attention has focused on the use of $top$ eigenvalues of sample covariance matrix, and there is little investigation into proper ways of utilizing $bulk$ eigenvalues to estimate $K$. We propose a principled approach to incorporating bulk eigenvalues in the estimation of $K$. Our method imposes a working model on the residual covariance matrix, which is assumed to be a diagonal matrix whose entries are drawn from a gamma distribution. Under this model, the bulk eigenvalues are asymptotically close to the quantiles of a fixed parametric distribution. This motivates us to propose a two-step method: the first step uses bulk eigenvalues to estimate parameters of this distribution, and the second step leverages these parameters to assist the estimation of $K$. The resulting estimator $\hat{K}$ aggregates information in a large number of bulk eigenvalues. We show the consistency of $\hat{K}$ under a standard spiked covariance model. We also propose a confidence interval estimate for $K$. Our extensive simulation studies show that the proposed method is robust and outperforms the existing methods in a range of scenarios. We apply the proposed method to analysis of a lung cancer microarray data set and the 1000 Genomes data set.

preprint2020arXiv

Community Detection for Hypergraph Networks via Regularized Tensor Power Iteration

To date, social network analysis has been largely focused on pairwise interactions. The study of higher-order interactions, via a hypergraph network, brings in new insights. We study community detection in a hypergraph network. A popular approach is to project the hypergraph to a graph and then apply community detection methods for graph networks, but we show that this approach may cause unwanted information loss. We propose a new method for community detection that operates directly on the hypergraph. At the heart of our method is a regularized higher-order orthogonal iteration (reg-HOOI) algorithm that computes an approximate low-rank decomposition of the network adjacency tensor. Compared with existing tensor decomposition methods such as HOSVD and vanilla HOOI, reg-HOOI yields better performance, especially when the hypergraph is sparse. Given the output of tensor decomposition, we then generalize the community detection method SCORE (Jin, 2015) from graph networks to hypergraph networks. We call our new method Tensor-SCORE. In theory, we introduce a degree-corrected block model for hypergraphs (hDCBM), and show that Tensor-SCORE yields consistent community detection for a wide range of network sparsity and degree heterogeneity. As a byproduct, we derive the rates of convergence on estimating the principal subspace by reg-HOOI, with different initializations, including the two new initialization methods we propose, a diagonal-removed HOSVD and a randomized graph projection. We apply our method to several real hypergraph networks which yields encouraging results. It suggests that exploring higher-order interactions provides additional information not seen in graph representations.

preprint2020arXiv

Measurement error models: from nonparametric methods to deep neural networks

The success of deep learning has inspired recent interests in applying neural networks in statistical inference. In this paper, we investigate the use of deep neural networks for nonparametric regression with measurement errors. We propose an efficient neural network design for estimating measurement error models, in which we use a fully connected feed-forward neural network (FNN) to approximate the regression function $f(x)$, a normalizing flow to approximate the prior distribution of $X$, and an inference network to approximate the posterior distribution of $X$. Our method utilizes recent advances in variational inference for deep neural networks, such as the importance weight autoencoder, doubly reparametrized gradient estimator, and non-linear independent components estimation. We conduct an extensive numerical study to compare the neural network approach with classical nonparametric methods and observe that the neural network approach is more flexible in accommodating different classes of regression functions and performs superior or comparable to the best available method in nearly all settings.

preprint2016arXiv

A Geometrical Approach to Topic Model Estimation

In the probabilistic topic models, the quantity of interest---a low-rank matrix consisting of topic vectors---is hidden in the text corpus matrix, masked by noise, and the Singular Value Decomposition (SVD) is a potentially useful tool for learning such a low-rank matrix. However, the connection between this low-rank matrix and the singular vectors of the text corpus matrix are usually complicated and hard to spell out, so how to use SVD for learning topic models faces challenges. In this paper, we overcome the challenge by revealing a surprising insight: there is a low-dimensional simplex structure which can be viewed as a bridge between the low-rank matrix of interest and the SVD of the text corpus matrix, and allows us to conveniently reconstruct the former using the latter. Such an insight motivates a new SVD approach to learning topic models, which we analyze with delicate random matrix theory and derive the rate of convergence. We support our methods and theory numerically, using both simulated data and real data.

preprint2016arXiv

Phase Transitions for High Dimensional Clustering and Related Problems

Consider a two-class clustering problem where we observe $X_i = \ell_i μ+ Z_i$, $Z_i \stackrel{iid}{\sim} N(0, I_p)$, $1 \leq i \leq n$. The feature vector $μ\in R^p$ is unknown but is presumably sparse. The class labels $\ell_i\in\{-1, 1\}$ are also unknown and the main interest is to estimate them. We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we find the precise demarcation for the Region of Impossibility and Region of Possibility. In the former, useful features are too rare/weak for successful clustering. In the latter, useful features are strong enough to allow successful clustering. The results are extended to the case of colored noise using Le Cam's idea on comparison of experiments. We also extend the study on statistical limits for clustering to that for signal recovery and that for hypothesis testing. We compare the statistical limits for three problems and expose some interesting insight. We propose classical PCA and Important Features PCA (IF-PCA) for clustering. For a threshold $t > 0$, IF-PCA clusters by applying classical PCA to all columns of $X$ with an $L^2$-norm larger than $t$. We also propose two aggregation methods. For any parameter in the Region of Possibility, some of these methods yield successful clustering. We find an interesting phase transition for IF-PCA. Our results require delicate analysis, especially on post-selection Random Matrix Theory and on lower bound arguments.

preprint2015arXiv

QUADRO: A supervised dimension reduction method via Rayleigh quotient optimization

We propose a novel Rayleigh quotient based sparse quadratic dimension reduction method - named QUADRO (Quadratic Dimension Reduction via Rayleigh Optimization) - for analyzing high- dimensional data. Unlike in the linear setting where Rayleigh quotient optimization coincides with classification, these two problems are very different under nonlinear settings. In this paper, we clarify this difference and show that Rayleigh quotient optimization may be of independent scientific interests. One major challenge of Rayleigh quotient optimization is that the variance of quadratic statistics involves all fourth cross-moments of predictors, which are infeasible to compute for high-dimensional applications and may accumulate too many stochastic errors. This issue is resolved by considering a family of elliptical models. Moreover, for heavy-tail distributions, robust estimates of mean vectors and covariance matrices are employed to guarantee uniform convergence in estimating nonpolynomially many parameters, even though only the fourth moments are assumed. Methodologically, QUADRO is based on elliptical models which allow us to formulate the Rayleigh quotient maximization as a convex optimization problem. Computationally, we propose an efficient linearized augmented Lagrangian method to solve the constrained optimization problem. Theoretically, we provide explicit rates of convergence in terms of Rayleigh quotient under both Gaussian and general elliptical models. Thorough numerical results on both synthetic and real datasets are also provided to back up our theoretical results.

preprint2014arXiv

Covariate assisted screening and estimation

Consider a linear model $Y=Xβ+z$, where $X=X_{n,p}$ and $z\sim N(0,I_n)$. The vector $β$ is unknown but is sparse in the sense that most of its coordinates are $0$. The main interest is to separate its nonzero coordinates from the zero ones (i.e., variable selection). Motivated by examples in long-memory time series (Fan and Yao [Nonlinear Time Series: Nonparametric and Parametric Methods (2003) Springer]) and the change-point problem (Bhattacharya [In Change-Point Problems (South Hadley, MA, 1992) (1994) 28-56 IMS]), we are primarily interested in the case where the Gram matrix $G=X'X$ is nonsparse but sparsifiable by a finite order linear filter. We focus on the regime where signals are both rare and weak so that successful variable selection is very challenging but is still possible. We approach this problem by a new procedure called the covariate assisted screening and estimation (CASE). CASE first uses a linear filtering to reduce the original setting to a new regression model where the corresponding Gram (covariance) matrix is sparse. The new covariance matrix induces a sparse graph, which guides us to conduct multivariate screening without visiting all the submodels. By interacting with the signal sparsity, the graph enables us to decompose the original problem into many separated small-size subproblems (if only we know where they are!). Linear filtering also induces a so-called problem of information leakage, which can be overcome by the newly introduced patching technique. Together, these give rise to CASE, which is a two-stage screen and clean [Fan and Song Ann. Statist. 38 (2010) 3567-3604; Wasserman and Roeder Ann. Statist. 37 (2009) 2178-2201] procedure, where we first identify candidates of these submodels by patching and screening, and then re-examine each candidate to remove false positives.

preprint2014arXiv

Discussion: "A significance test for the lasso"

Discussion of "A significance test for the lasso" by Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani [arXiv:1301.7161].

Zheng Tracy Ke

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

A Comparison of Hamming Errors of Representative Variable Selection Methods

Allocation of COVID-19 Testing Budget on a Commute Network of Counties

Co-citation and Co-authorship Networks of Statisticians

Optimal Estimation of the Number of Communities

Power Enhancement and Phase Transitions for Global Testing of the Mixed Membership Stochastic Block Model

The SCORE normalization, especially for highly heterogeneous network and text data

Using SVD for Topic Modeling

Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis

Community Detection for Hypergraph Networks via Regularized Tensor Power Iteration

Measurement error models: from nonparametric methods to deep neural networks

A Geometrical Approach to Topic Model Estimation

Phase Transitions for High Dimensional Clustering and Related Problems

QUADRO: A supervised dimension reduction method via Rayleigh quotient optimization

Covariate assisted screening and estimation

Discussion: "A significance test for the lasso"