Source author record

Karl Rohe

Karl Rohe appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Machine Learning Methodology Social and Information Networks Applications Computational Complexity Digital Libraries math.PR

Catalog footprint

What is connected

15works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Targeted sampling from massive block model graphs with personalized PageRank

The paper provides statistical theory and intuition for personalized PageRank (called "PPR"): a popular technique that samples a small community from a massive network. We study a setting where the entire network is expensive to obtain thoroughly or to maintain, but we can start from a seed node of interest and "crawl" the network to find other nodes through their connections. By crawling the graph in a designed way, the PPR vector can be approximated without querying the entire massive graph, making it an alternative to snowball sampling. Using the degree-corrected stochastic block model, we study whether the PPR vector can select nodes that belong to the same block as the seed node. We provide a simple and interpretable form for the PPR vector, highlighting its biases towards high degree nodes outside the target block. We examine a simple adjustment based on node degrees and establish consistency results for PPR clustering that allows for directed graphs. These results are enabled by recent technical advances showing the elementwise convergence of eigenvectors. We illustrate the method with the massive Twitter friendship graph, which we crawl by using the Twitter application programming interface. We find that the adjusted and unadjusted PPR techniques are complementary approaches, where the adjustment makes the results particularly localized around the seed node, and that the bias adjustment greatly benefits from degree regularization.

preprint2020arXiv

Vintage Factor Analysis with Varimax Performs Statistical Inference

Psychologists developed Multiple Factor Analysis to decompose multivariate data into a small number of interpretable factors without any a priori knowledge about those factors. In this form of factor analysis, the Varimax "factor rotation" is a key step to make the factors interpretable. Charles Spearman and many others objected to factor rotations because the factors seem to be rotationally invariant. These objections are still reported in all contemporary multivariate statistics textbooks. This is an engima because this vintage form of factor analysis has survived and is widely popular because, empirically, the factor rotation often makes the factors easier to interpret. We argue that the rotation makes the factors easier to interpret because, in fact, the Varimax factor rotation performs statistical inference. We show that Principal Components Analysis (PCA) with the Varimax rotation provides a unified spectral estimation strategy for a broad class of modern factor models, including the Stochastic Blockmodel and a natural variation of Latent Dirichlet Allocation (i.e., "topic modeling"). In addition, we show that Thurstone's widely employed sparsity diagnostics implicitly assess a key "leptokurtic" condition that makes the rotation statistically identifiable in these models. Taken together, this shows that the know-how of Vintage Factor Analysis performs statistical inference, reversing nearly a century of statistical thinking on the topic. With a sparse eigensolver, PCA with Varimax is both fast and stable. Combined with Thurstone's straightforward diagnostics, this vintage approach is suitable for a wide array of modern applications.

preprint2017arXiv

Generalized least squares can overcome the critical threshold in respondent-driven sampling

In order to sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like $O(n^{-1})$, where $n$ is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is $O(n^{-1})$. We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from the sampled observations. Simulations on empirical social networks show that the feasible GLS (fGLS) estimators can have drastically smaller error and rarely increase the error. A diagnostic plot helps to identify where fGLS will aid estimation. The fGLS estimators continue to outperform standard estimators even when they are built from a misspecified model and when there is preferential recruitment.

preprint2016arXiv

Asymptotic Theory for Estimating the Singular Vectors and Values of a Partially-observed Low Rank Matrix with Noise

Matrix completion algorithms recover a low rank matrix from a small fraction of the entries, each entry contaminated with additive errors. In practice, the singular vectors and singular values of the low rank matrix play a pivotal role for statistical analyses and inferences. This paper proposes estimators of these quantities and studies their asymptotic behavior. Under the setting where the dimensions of the matrix increase to infinity and the probability of observing each entry is identical, Theorem 4.1 gives the rate of convergence for the estimated singular vectors; Theorem 4.3 gives a multivariate central limit theorem for the estimated singular values. Even though the estimators use only a partially observed matrix, they achieve the same rates of convergence as the fully observed case. These estimators combine to form a consistent estimator of the full low rank matrix that is computed with a non-iterative algorithm. In the cases studied in this paper, this estimator achieves the minimax lower bound in Koltchinskii et al. (2011). The numerical experiments corroborate our theoretical results.

preprint2016arXiv

Central limit theorems for network driven sampling

Respondent-Driven Sampling is a popular technique for sampling hidden populations. This paper models Respondent-Driven Sampling as a Markov process indexed by a tree. Our main results show that the Volz-Heckathorn estimator is asymptotically normal below a critical threshold. The key technical difficulties stem from (i) the dependence between samples and (ii) the tree structure which characterizes the dependence. The theorems allow the growth rate of the tree to exceed one and suggest that this growth rate should not be too large. To illustrate the usefulness of these results beyond their obvious use, an example shows that in certain cases the sample average is preferable to inverse probability weighting. We provide a test statistic to distinguish between these two cases.

preprint2016arXiv

Intelligent Initialization and Adaptive Thresholding for Iterative Matrix Completion; Some Statistical and Algorithmic Theory for Adaptive-Impute

Over the past decade, various matrix completion algorithms have been developed. Thresholded singular value decomposition (SVD) is a popular technique in implementing many of them. A sizable number of studies have shown its theoretical and empirical excellence, but choosing the right threshold level still remains as a key empirical difficulty. This paper proposes a novel matrix completion algorithm which iterates thresholded SVD with theoretically-justified and data-dependent values of thresholding parameters. The estimate of the proposed algorithm enjoys the minimax error rate and shows outstanding empirical performances. The thresholding scheme that we use can be viewed as a solution to a non-convex optimization problem, understanding of whose theoretical convergence guarantee is known to be limited. We investigate this problem by introducing a simpler algorithm, generalized-\SI, analyzing its convergence behavior, and connecting it to the proposed algorithm.

preprint2015arXiv

Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim

Directed graphs have asymmetric connections, yet the current graph clustering methodologies cannot identify the potentially global structure of these asymmetries. We give a spectral algorithm called di-sim that builds on a dual measure of similarity that correspond to how a node (i) sends and (ii) receives edges. Using di-sim, we analyze the global asymmetries in the networks of Enron emails, political blogs, and the c elegans neural connectome. In each example, a small subset of nodes have persistent asymmetries; these nodes send edges with one cluster, but receive edges with another cluster. Previous approaches would have assigned these asymmetric nodes to only one cluster, failing to identify their sending/receiving asymmetries. Regularization and "projection" are two steps of di-sim that are essential for spectral clustering algorithms to work in practice. The theoretical results show that these steps make the algorithm weakly consistent under the degree corrected Stochastic co-Blockmodel, a model that generalizes the Stochastic Blockmodel to allow for both (i) degree heterogeneity and (ii) the global asymmetries that we intend to detect. The theoretical results make no assumptions on the smallest degree nodes. Instead, the theorem requires that the average degree grows sufficiently fast and that the weak consistency only applies to the subset of the nodes with sufficiently large leverage scores. The results results also apply to bipartite graphs.

preprint2014arXiv

A note relating ridge regression and OLS p-values to preconditioned sparse penalized regression

When the design matrix has orthonormal columns, "soft thresholding" the ordinary least squares (OLS) solution produces the Lasso solution [Tibshirani, 1996]. If one uses the Puffer preconditioned Lasso [Jia and Rohe, 2012], then this result generalizes from orthonormal designs to full rank designs (Theorem 1). Theorem 2 refines the Puffer preconditioner to make the Lasso select the same model as removing the elements of the OLS solution with the largest p-values. Using a generalized Puffer preconditioner, Theorem 3 relates ridge regression to the preconditioned Lasso; this result is for the high dimensional setting, p > n. Where the standard Lasso is akin to forward selection [Efron et al., 2004], Theorems 1, 2, and 3 suggest that the preconditioned Lasso is more akin to backward elimination. These results hold for sparse penalties beyond l1; for a broad class of sparse and non-convex techniques (e.g. SCAD and MC+), the results hold for all local minima.

preprint2014arXiv

Discussion of "Estimating the historical and future probabilities of large terrorist events" by Aaron Clauset and Ryan Woodard

Discussion of "Estimating the historical and future probabilities of large terrorist events" by Aaron Clauset and Ryan Woodard [arXiv:1209.0089].

preprint2013arXiv

Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

Spectral clustering is a fast and popular algorithm for finding clusters in networks. Recently, Chaudhuri et al. (2012) and Amini et al.(2012) proposed inspired variations on the algorithm that artificially inflate the node degrees for improved statistical performance. The current paper extends the previous statistical estimation results to the more canonical spectral clustering algorithm in a way that removes any assumption on the minimum degree and provides guidance on the choice of the tuning parameter. Moreover, our results show how the "star shape" in the eigenvectors--a common feature of empirical networks--can be explained by the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees. Throughout, the paper characterizes and justifies several of the variations of the spectral clustering algorithm in terms of these models.

preprint2013arXiv

The blessing of transitivity in sparse and stochastic networks

The interaction between transitivity and sparsity, two common features in empirical networks, implies that there are local regions of large sparse networks that are dense. We call this the blessing of transitivity and it has consequences for both modeling and inference. Extant research suggests that statistical inference for the Stochastic Blockmodel is more difficult when the edges are sparse. However, this conclusion is confounded by the fact that the asymptotic limit in all of the previous studies is not merely sparse, but also non-transitive. To retain transitivity, the blocks cannot grow faster than the expected degree. Thus, in sparse models, the blocks must remain asymptotically small. \n Previous algorithmic research demonstrates that small "local" clusters are more amenable to computation, visualization, and interpretation when compared to "global" graph partitions. This paper provides the first statistical results that demonstrate how these small transitive clusters are also more amenable to statistical estimation. Theorem 2 shows that a "local" clustering algorithm can, with high probability, detect a transitive stochastic block of a fixed size (e.g. 30 nodes) embedded in a large graph. The only constraint on the ambient graph is that it is large and sparse--it could be generated at random or by an adversary--suggesting a theoretical explanation for the robust empirical performance of local clustering algorithms.

preprint2013arXiv

The Highest Dimensional Stochastic Blockmodel with a Regularized Estimator

In the high dimensional Stochastic Blockmodel for a random network, the number of clusters (or blocks) K grows with the number of nodes N. Two previous studies have examined the statistical estimation performance of spectral clustering and the maximum likelihood estimator under the high dimensional model; neither of these results allow K to grow faster than N^{1/2}. We study a model where, ignoring log terms, K can grow proportionally to N. Since the number of clusters must be smaller than the number of nodes, no reasonable model allows K to grow faster; thus, our asymptotic results are the "highest" dimensional. To push the asymptotic setting to this extreme, we make additional assumptions that are motivated by empirical observations in physical anthropology (Dunbar, 1992), and an in depth study of massive empirical networks (Leskovec et al 2008). Furthermore, we develop a regularized maximum likelihood estimator that leverages these insights and we prove that, under certain conditions, the proportion of nodes that the regularized estimator misclusters converges to zero. This is the first paper to explicitly introduce and demonstrate the advantages of statistical regularization in a parametric form for network analysis.

preprint2012arXiv

Preconditioning to comply with the Irrepresentable Condition

Preconditioning is a technique from numerical linear algebra that can accelerate algorithms to solve systems of equations. In this paper, we demonstrate how preconditioning can circumvent a stringent assumption for sign consistency in sparse linear regression. Given $X \in R^{n \times p}$ and $Y \in R^n$ that satisfy the standard regression equation, this paper demonstrates that even if the design matrix $X$ does not satisfy the irrepresentable condition for the Lasso, the design matrix $F X$ often does, where $F \in R^{n\times n}$ is a preconditioning matrix defined in this paper. By computing the Lasso on $(F X, F Y)$, instead of on $(X, Y)$, the necessary assumptions on $X$ become much less stringent. Our preconditioner $F$ ensures that the singular values of the design matrix are either zero or one. When $n\ge p$, the columns of $F X$ are orthogonal and the preconditioner always circumvents the stringent assumptions. When $p\ge n$, $F$ projects the design matrix onto the Stiefel manifold; the rows of $F X$ are orthogonal. We give both theoretical results and simulation results to show that, in the high dimensional case, the preconditioner helps to circumvent the stringent assumptions, improving the statistical performance of a broad class of model selection techniques in linear regression. Simulation results are particularly promising.

preprint2011arXiv

Spectral clustering and the high-dimensional stochastic blockmodel

Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social networks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasible method to discover these communities. The stochastic blockmodel [Social Networks 5 (1983) 109--137] is a social network model with well-defined communities; each node is a member of one community. For a network generated from the Stochastic Blockmodel, we bound the number of nodes "misclustered" by spectral clustering. The asymptotic results in this paper are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional. In order to study spectral clustering under the stochastic blockmodel, we first show that under the more general latent space model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a "population" normalized graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique. Our method of studying the eigenvectors of random matrices is original.

preprint2010arXiv

The Lasso under Heteroscedasticity

The performance of the Lasso is well understood under the assumptions of the standard linear model with homoscedastic noise. However, in several applications, the standard model does not describe the important features of the data. This paper examines how the Lasso performs on a non-standard model that is motivated by medical imaging applications. In these applications, the variance of the noise scales linearly with the expectation of the observation. Like all heteroscedastic models, the noise terms in this Poisson-like model are \textit{not} independent of the design matrix. More specifically, this paper studies the sign consistency of the Lasso under a sparse Poisson-like model. In addition to studying sufficient conditions for the sign consistency of the Lasso estimate, this paper also gives necessary conditions for sign consistency. Both sets of conditions are comparable to results for the homoscedastic model, showing that when a measure of the signal to noise ratio is large, the Lasso performs well on both Poisson-like data and homoscedastic data. Simulations reveal that the Lasso performs equally well in terms of model selection performance on both Poisson-like data and homoscedastic data (with properly scaled noise variance), across a range of parameterizations. Taken as a whole, these results suggest that the Lasso is robust to the Poisson-like heteroscedastic noise.

Karl Rohe

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Targeted sampling from massive block model graphs with personalized PageRank

Vintage Factor Analysis with Varimax Performs Statistical Inference

Generalized least squares can overcome the critical threshold in respondent-driven sampling

Asymptotic Theory for Estimating the Singular Vectors and Values of a Partially-observed Low Rank Matrix with Noise

Central limit theorems for network driven sampling

Intelligent Initialization and Adaptive Thresholding for Iterative Matrix Completion; Some Statistical and Algorithmic Theory for Adaptive-Impute

Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim

A note relating ridge regression and OLS p-values to preconditioned sparse penalized regression

Discussion of "Estimating the historical and future probabilities of large terrorist events" by Aaron Clauset and Ryan Woodard

Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

The blessing of transitivity in sparse and stochastic networks

The Highest Dimensional Stochastic Blockmodel with a Regularized Estimator

Preconditioning to comply with the Irrepresentable Condition

Spectral clustering and the high-dimensional stochastic blockmodel

The Lasso under Heteroscedasticity