Source author record

Vince Lyzinski

Vince Lyzinski appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.PR math.OC Methodology Computation math.CO math.ST Statistics Theory Applications Neurons and Cognition Social and Information Networks Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Information Retrieval math.NA math.NT Neural and Evolutionary Computing Numerical Analysis

Catalog footprint

What is connected

25works

18topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Adversarial contamination of networks in the setting of vertex nomination: a new trimming method

As graph data becomes more ubiquitous, the need for robust inferential graph algorithms to operate in these complex data domains is crucial. In many cases of interest, inference is further complicated by the presence of adversarial data contamination. The effect of the adversary is frequently to change the data distribution in ways that negatively affect statistical and algorithmic performance. We study this phenomenon in the context of vertex nomination, a semi-supervised information retrieval task for network data. Here, a common suite of methods relies on spectral graph embeddings, which have been shown to provide both good algorithmic performance and flexible settings in which regularization techniques can be implemented to help mitigate the effect of an adversary. Many current regularization methods rely on direct network trimming to effectively excise the adversarial contamination, although this direct trimming often gives rise to complicated dependency structures in the resulting graph. We propose a new trimming method that operates in model space which can address both block structure contamination and white noise contamination (contamination whose distribution is unknown). This model trimming is more amenable to theoretical analysis while also demonstrating superior performance in a number of simulations, compared to direct trimming.

preprint2022arXiv

Vertex nomination between graphs via spectral embedding and quadratic programming

Given a network and a subset of interesting vertices whose identities are only partially known, the vertex nomination problem seeks to rank the remaining vertices in such a way that the interesting vertices are ranked at the top of the list. An important variant of this problem is vertex nomination in the multi-graphs setting. Given two graphs $G_1, G_2$ with common vertices and a vertex of interest $x \in G_1$, we wish to rank the vertices of $G_2$ such that the vertices most similar to $x$ are ranked at the top of the list. The current paper addresses this problem and proposes a method that first applies adjacency spectral graph embedding to embed the graphs into a common Euclidean space, and then solves a penalized linear assignment problem to obtain the nomination lists. Since the spectral embedding of the graphs are only unique up to orthogonal transformations, we present two approaches to eliminate this potential non-identifiability. One approach is based on orthogonal Procrustes and is applicable when there are enough vertices with known correspondence between the two graphs. Another approach uses adaptive point set registration and is applicable when there are few or no vertices with known correspondence. We show that our nomination scheme leads to accurate nomination under a generative model for pairs of random graphs that are approximately low-rank and possibly with pairwise edge correlations. We illustrate our algorithm's performance through simulation studies on synthetic data as well as analysis of a high-school friendship network and analysis of transition rates between web pages on the Bing search engine.

preprint2020arXiv

Alignment Strength and Correlation for Graphs

When two graphs have a correlated Bernoulli distribution, we prove that the alignment strength of their natural bijection strongly converges to a novel measure of graph correlation $ρ_T$ that neatly combines intergraph with intragraph distribution parameters. Within broad families of the random graph parameter settings, we illustrate that exact graph matching runtime and also matchability are both functions of $ρ_T$, with thresholding behavior starkly illustrated in matchability.

preprint2020arXiv

Maximum Likelihood Estimation and Graph Matching in Errorfully Observed Networks

Given a pair of graphs with the same number of vertices, the inexact graph matching problem consists in finding a correspondence between the vertices of these graphs that minimizes the total number of induced edge disagreements. We study this problem from a statistical framework in which one of the graphs is an errorfully observed copy of the other. We introduce a corrupting channel model, and show that in this model framework, the solution to the graph matching problem is a maximum likelihood estimator. Necessary and sufficient conditions for consistency of this MLE are presented, as well as a relaxed notion of consistency in which a negligible fraction of the vertices need not be matched correctly. The results are used to study matchability in several families of random graphs, including edge independent models, random regular graphs and small-world networks. We also use these results to introduce measures of matching feasibility, and experimentally validate the results on simulated and real-world networks.

preprint2020arXiv

Numerical tolerance for spectral decompositions of random matrices

We precisely quantify the impact of statistical error in the quality of a numerical approximation to a random matrix eigendecomposition, and under mild conditions, we use this to introduce an optimal numerical tolerance for residual error in spectral decompositions of random matrices. We demonstrate that terminating an eigendecomposition algorithm when the numerical error and statistical error are of the same order results in computational savings with no loss of accuracy. We also repair a flaw in a ubiquitous termination condition, one in wide employ in several computational linear algebra implementations. We illustrate the practical consequences of our stopping criterion with an analysis of simulated and real networks. Our theoretical results and real-data examples establish that the tradeoff between statistical and numerical error is of significant import for data science.

preprint2020arXiv

Vertex Nomination, Consistent Estimation, and Adversarial Modification

Given a pair of graphs $G_1$ and $G_2$ and a vertex set of interest in $G_1$, the vertex nomination (VN) problem seeks to find the corresponding vertices of interest in $G_2$ (if they exist) and produce a rank list of the vertices in $G_2$, with the corresponding vertices of interest in $G_2$ concentrating, ideally, at the top of the rank list. In this paper, we define and derive the analogue of Bayes optimality for VN with multiple vertices of interest, and we define the notion of maximal consistency classes in vertex nomination. This theory forms the foundation for a novel VN adversarial contamination model, and we demonstrate with real and simulated data that there are VN schemes that perform effectively in the uncontaminated setting, and adversarial network contamination adversely impacts the performance of our VN scheme. We further define a network regularization method for mitigating the impact of the adversarial contamination, and we demonstrate the effectiveness of regularization in both real and synthetic data.

preprint2020arXiv

Vertex nomination: The canonical sampling and the extended spectral nomination schemes

Suppose that one particular block in a stochastic block model is of interest, but block labels are only observed for a few of the vertices in the network. Utilizing a graph realized from the model and the observed block labels, the vertex nomination task is to order the vertices with unobserved block labels into a ranked nomination list with the goal of having an abundance of interesting vertices near the top of the list. There are vertex nomination schemes in the literature, including the optimally precise canonical nomination scheme~$\mathcal{L}^C$ and the consistent spectral partitioning nomination scheme~$\mathcal{L}^P$. While the canonical nomination scheme $\mathcal{L}^C$ is provably optimally precise, it is computationally intractable, being impractical to implement even on modestly sized graphs. With this in mind, an approximation of the canonical scheme---denoted the {\it canonical sampling nomination scheme} $\mathcal{L}^{CS}$---is introduced; $\mathcal{L}^{CS}$ relies on a scalable, Markov chain Monte Carlo-based approximation of $\mathcal{L}^{C}$, and converges to $\mathcal{L}^{C}$ as the amount of sampling goes to infinity. The spectral partitioning nomination scheme is also extended to the {\it extended spectral partitioning nomination scheme}, $\mathcal{L}^{EP}$, which introduces a novel semisupervised clustering framework to improve upon the precision of $\mathcal{L}^P$. Real-data and simulation experiments are employed to illustrate the precision of these vertex nomination schemes, as well as their empirical computational complexity. Keywords: vertex nomination, Markov chain Monte Carlo, spectral partitioning, Mclust MSC[2010]: 60J22, 65C40, 62H30, 62H25

preprint2016arXiv

Community Detection and Classification in Hierarchical Stochastic Blockmodels

We propose a robust, scalable, integrated methodology for community detection and community comparison in graphs. In our procedure, we first embed a graph into an appropriate Euclidean space to obtain a low-dimensional representation, and then cluster the vertices into communities. We next employ nonparametric graph inference techniques to identify structural similarity among these communities. These two steps are then applied recursively on the communities, allowing us to detect more fine-grained structure. We describe a hierarchical stochastic blockmodel---namely, a stochastic blockmodel with a natural hierarchical structure---and establish conditions under which our algorithm yields consistent estimates of model parameters and motifs, which we define to be stochastically similar groups of subgraphs. Finally, we demonstrate the effectiveness of our algorithm in both simulated and real data. Specifically, we address the problem of locating similar subcommunities in a partially reconstructed Drosophila connectome and in the social network Friendster.

preprint2016arXiv

Fast Embedding for JOFC Using the Raw Stress Criterion

The Joint Optimization of Fidelity and Commensurability (JOFC) manifold matching methodology embeds an omnibus dissimilarity matrix consisting of multiple dissimilarities on the same set of objects. One approach to this embedding optimizes the preservation of fidelity to each individual dissimilarity matrix together with commensurability of each given observation across modalities via iterative majorization of a raw stress error criterion by successive Guttman transforms. In this paper, we exploit the special structure inherent to JOFC to exactly and efficiently compute the successive Guttman transforms, and as a result we are able to greatly speed up the JOFC procedure for both in-sample and out-of-sample embedding. We demonstrate the scalability of our implementation on both real and simulated data examples.

preprint2016arXiv

On the Consistency of the Likelihood Maximization Vertex Nomination Scheme: Bridging the Gap Between Maximum Likelihood Estimation and Graph Matching

Given a graph in which a few vertices are deemed interesting a priori, the vertex nomination task is to order the remaining vertices into a nomination list such that there is a concentration of interesting vertices at the top of the list. Previous work has yielded several approaches to this problem, with theoretical results in the setting where the graph is drawn from a stochastic block model (SBM), including a vertex nomination analogue of the Bayes optimal classifier. In this paper, we prove that maximum likelihood (ML)-based vertex nomination is consistent, in the sense that the performance of the ML-based scheme asymptotically matches that of the Bayes optimal scheme. We prove theorems of this form both when model parameters are known and unknown. Additionally, we introduce and prove consistency of a related, more scalable restricted-focus ML vertex nomination scheme. Finally, we incorporate vertex and edge features into ML-based vertex nomination and briefly explore the empirical effectiveness of this approach.

preprint2016arXiv

Scalable Out-of-Sample Extension of Graph Embeddings Using Deep Neural Networks

Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.

preprint2016arXiv

Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs

Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory. Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of keeping data on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on graphs when the dense matrix has more than four columns; it achieves at least 65% performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis tasks--PageRank, eigensolving, and non-negative matrix factorization--and show that our SEM implementations significantly advance the state of the art.

preprint2015arXiv

A Joint Graph Inference Case Study: the C.elegans Chemical and Electrical Connectomes

We investigate joint graph inference for the chemical and electrical connectomes of the \textit{Caenorhabditis elegans} roundworm. The \textit{C.elegans} connectomes consist of $253$ non-isolated neurons with known functional attributes, and there are two types of synaptic connectomes, resulting in a pair of graphs. We formulate our joint graph inference from the perspectives of seeded graph matching and joint vertex classification. Our results suggest that connectomic inference should proceed in the joint space of the two connectomes, which has significant neuroscientific implications.

preprint2015arXiv

A nonparametric two-sample hypothesis testing problem for random dot product graphs

We consider the problem of testing whether two finite-dimensional random dot product graphs have generating latent positions that are independently drawn from the same distribution, or distributions that are related via scaling or projection. We propose a test statistic that is a kernel-based function of the adjacency spectral embedding for each graph. We obtain a limiting distribution for our test statistic under the null and we show that our test procedure is consistent across a broad range of alternatives.

preprint2015arXiv

A semiparametric two-sample hypothesis testing problem for random dot product graphs

Two-sample hypothesis testing for random graphs arises naturally in neuroscience, social networks, and machine learning. In this paper, we consider a semiparametric problem of two-sample hypothesis testing for a class of latent position random graphs. We formulate a notion of consistency in this context and propose a valid test for the hypothesis that two finite-dimensional random dot product graphs on a common vertex set have the same generating latent positions or have generating latent positions that are scaled or diagonal transformations of one another. Our test statistic is a function of a spectral decomposition of the adjacency matrix for each graph and our test procedure is consistent across a broad range of alternatives. We apply our test procedure to real biological data: in a test-retest data set of neural connectome graphs, we are able to distinguish between scans from different subjects; and in the {\em C.elegans} connectome, we are able to distinguish between chemical and electrical networks. The latter example is a concrete demonstration that our test can have power even for small sample sizes. We conclude by discussing the relationship between our test procedure and generalized likelihood ratio tests.

preprint2015arXiv

Graph Matching: Relax at Your Own Risk

Graph matching---aligning a pair of graphs to minimize their edge disagreements---has received wide-spread attention from both theoretical and applied communities over the past several decades, including combinatorics, computer vision, and connectomics. Its attention can be partially attributed to its computational difficulty. Although many heuristics have previously been proposed in the literature to approximately solve graph matching, very few have any theoretical support for their performance. A common technique is to relax the discrete problem to a continuous problem, therefore enabling practitioners to bring gradient-descent-type algorithms to bear. We prove that an indefinite relaxation (when solved exactly) almost always discovers the optimal permutation, while a common convex relaxation almost always fails to discover the optimal permutation. These theoretical results suggest that initializing the indefinite algorithm with the convex optimum might yield improved practical performance. Indeed, experimental results illuminate and corroborate these theoretical findings, demonstrating that excellent results are achieved in both benchmark and real data problems by amalgamating the two approaches.

preprint2015arXiv

Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding

Vertex clustering in a stochastic blockmodel graph has wide applicability and has been the subject of extensive research. In thispaper, we provide a short proof that the adjacency spectral embedding can be used to obtain perfect clustering for the stochastic blockmodel and the degree-corrected stochastic blockmodel. We also show an analogous result for the more general random dot product graph model.

preprint2015arXiv

Spectral Clustering for Divide-and-Conquer Graph Matching

We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through our divide-and-conquer procedure. We also demonstrate the effectiveness of our approach in matching very large graphs in simulated and real data examples, showing up to a factor of 8 improvement in runtime with minimal sacrifice in accuracy.

preprint2015arXiv

Strong Stationary Duality for Diffusion Processes

We develop the theory of strong stationary duality for diffusion processes on compact intervals. We analytically derive the generator and boundary behavior of the dual process and recover a central tenet of the classical Markov chain theory in the diffusion setting by linking the separation distance in the primal diffusion to the absorption time in the dual diffusion. We also exhibit our strong stationary dual as the natural limiting process of the strong stationary dual sequence of a well chosen sequence of approximating birth-and-death Markov chains, allowing for simultaneous numerical simulations of our primal and dual diffusion processes. Lastly, we show how our new definition of diffusion duality allows the spectral theory of cutoff phenomena to extend naturally from birth-and-death Markov chains to the present diffusion context.

preprint2014arXiv

Fast Approximate Quadratic Programming for Large (Brain) Graph Matching

Quadratic assignment problems (QAPs) arise in a wide variety of domains, ranging from operations research to graph theory to computer vision to neuroscience. In the age of big data, graph valued data is becoming more prominent, and with it, a desire to run algorithms on ever larger graphs. Because QAP is NP-hard, exact algorithms are intractable. Approximate algorithms necessarily employ an accuracy/efficiency trade-off. We developed a fast approximate quadratic assignment algorithm (FAQ). FAQ finds a local optima in (worst case) time cubic in the number of vertices, similar to other approximate QAP algorithms. We demonstrate empirically that our algorithm is faster and achieves a lower objective value on over 80% of the suite of QAP benchmarks, compared with the previous state-of-the-art. Applying the algorithms to our motivating example, matching C. elegans connectomes (brain-graphs), we find that FAQ achieves the optimal performance in record time, whereas none of the others even find the optimum.

preprint2014arXiv

Seeded graph matching for correlated Erdős-Rényi graphs

Graph matching is an important problem in machine learning and pattern recognition. Herein, we present theoretical and practical results on the consistency of graph matching for estimating a latent alignment function between the vertex sets of two graphs, as well as subsequent algorithmic implications when the latent alignment is partially observed. In the correlated Erdős-Rényi graph setting, we prove that graph matching provides a strongly consistent estimate of the latent alignment in the presence of even modest correlation. We then investigate a tractable, restricted-focus version of graph matching, which is only concerned with adjacency involving vertices in a partial observation of the latent alignment; we prove that a logarithmic number of vertices whose alignment is known is sufficient for this restricted-focus version of graph matching to yield a strongly consistent estimate of the latent alignment of the remaining vertices. We show how Frank-Wolfe methodology for approximate graph matching, when there is a partially observed latent alignment, inherently incorporates this restricted focus graph matching. Lastly, we illustrate the relationship between seeded graph matching and restricted-focus graph matching by means of an illuminating example from human connectomics.

preprint2013arXiv

A central limit theorem for scaled eigenvectors of random dot product graphs

We prove a central limit theorem for the components of the largest eigenvectors of the adjacency matrix of a finite-dimensional random dot product graph whose true latent positions are unknown. In particular, we follow the methodology outlined in \citet{sussman2012universally} to construct consistent estimates for the latent positions, and we show that the appropriately scaled differences between the estimated and true latent positions converge to a mixture of Gaussian random variables. As a corollary, we obtain a central limit theorem for the first eigenvector of the adjacency matrix of an Erdös-Renyi random graph.

preprint2013arXiv

Logarithmic Representability of Integers as k-Sums

A set A=A_{k,n} in [n]\cup{0} is said to be an additive k-basis if each element in {0,1,...,kn} can be written as a k-sum of elements of A in at least one way. Seeking multiple representations as k-sums, and given any function phi(n), with lim(phi(n))=infinity, we say that A is a truncated phi(n)-representative k-basis for [n] if for each j in [alpha n, (k-alpha)n] the number of ways that j can be represented as a k-sum of elements of A_{k,n} is Theta(phi(n)). In this paper, we follow tradition and focus on the case phi(n)=log n, and show that a randomly selected set in an appropriate probability space is a truncated log-representative basis with probability that tends to one as n tends to infinity. This result is a finite version of a result proved by Erdos (1956) and extended by Erdos and Tetali (1990).

preprint2012arXiv

Hitting times and interlacing eigenvalues: a stochastic approach using intertwinings

We develop a systematic matrix-analytic approach, based on intertwinings of Markov semigroups, for proving theorems about hitting-time distributions for finite-state Markov chains -- an approach that (sometimes) deepens understanding of the theorems by providing corresponding sample-path-by-sample-path stochastic constructions. We employ our approach to give new proofs and constructions for two theorems due to Mark Brown, theorems giving two quite different representations of hitting-time distributions for finite-state Markov chains started in stationarity. The proof, and corresponding construction, for one of the two theorems elucidates an intriguing connection between hitting-time distributions and the interlacing eigenvalues theorem for bordered symmetric matrices.

preprint2012arXiv

Sharp Threshold Asymptotics for the Emergence of Additive Bases

A subset A of {0,1,...,n} is said to be a 2-additive basis for {1,2,...,n} if each j in {1,2,...,n} can be written as j=x+y, x,y in A, x<=y. If we pick each integer in {0,1,...,n} independently with probability p=p_n tending to 0, thus getting a random set A, what is the probability that we have obtained a 2-additive basis? We address this question when the target sum-set is [(1-alpha)n,(1+alpha)n] (or equivalently [alpha n, (2-alpha) n]) for some 0<alpha<1. Under either model, the Stein-Chen method of Poisson approximation is used, in conjunction with Janson's inequalities, to tease out a very sharp threshold for the emergence of a 2-additive basis. Generalizations to k-additive bases are then given.

Vince Lyzinski

What is connected

Connect this record

See the researcher in context

Building this map preview

25 published item(s)

Adversarial contamination of networks in the setting of vertex nomination: a new trimming method

Vertex nomination between graphs via spectral embedding and quadratic programming

Alignment Strength and Correlation for Graphs

Maximum Likelihood Estimation and Graph Matching in Errorfully Observed Networks

Numerical tolerance for spectral decompositions of random matrices

Vertex Nomination, Consistent Estimation, and Adversarial Modification

Vertex nomination: The canonical sampling and the extended spectral nomination schemes

Community Detection and Classification in Hierarchical Stochastic Blockmodels

Fast Embedding for JOFC Using the Raw Stress Criterion

On the Consistency of the Likelihood Maximization Vertex Nomination Scheme: Bridging the Gap Between Maximum Likelihood Estimation and Graph Matching

Scalable Out-of-Sample Extension of Graph Embeddings Using Deep Neural Networks

Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs

A Joint Graph Inference Case Study: the C.elegans Chemical and Electrical Connectomes

A nonparametric two-sample hypothesis testing problem for random dot product graphs

A semiparametric two-sample hypothesis testing problem for random dot product graphs

Graph Matching: Relax at Your Own Risk

Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding

Spectral Clustering for Divide-and-Conquer Graph Matching

Strong Stationary Duality for Diffusion Processes

Fast Approximate Quadratic Programming for Large (Brain) Graph Matching

Seeded graph matching for correlated Erdős-Rényi graphs

A central limit theorem for scaled eigenvectors of random dot product graphs

Logarithmic Representability of Integers as k-Sums

Hitting times and interlacing eigenvalues: a stochastic approach using intertwinings

Sharp Threshold Asymptotics for the Emergence of Additive Bases