Researcher profile

Minh Tang

Minh Tang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2022arXiv

Adversarial contamination of networks in the setting of vertex nomination: a new trimming method

As graph data becomes more ubiquitous, the need for robust inferential graph algorithms to operate in these complex data domains is crucial. In many cases of interest, inference is further complicated by the presence of adversarial data contamination. The effect of the adversary is frequently to change the data distribution in ways that negatively affect statistical and algorithmic performance. We study this phenomenon in the context of vertex nomination, a semi-supervised information retrieval task for network data. Here, a common suite of methods relies on spectral graph embeddings, which have been shown to provide both good algorithmic performance and flexible settings in which regularization techniques can be implemented to help mitigate the effect of an adversary. Many current regularization methods rely on direct network trimming to effectively excise the adversarial contamination, although this direct trimming often gives rise to complicated dependency structures in the resulting graph. We propose a new trimming method that operates in model space which can address both block structure contamination and white noise contamination (contamination whose distribution is unknown). This model trimming is more amenable to theoretical analysis while also demonstrating superior performance in a number of simulations, compared to direct trimming.

preprint2022arXiv

Hypothesis Testing for Equality of Latent Positions in Random Graphs

We consider the hypothesis testing problem that two vertices $i$ and $j$ of a generalized random dot product graph have the same latent positions, possibly up to scaling. Special cases of this hypothesis test include testing whether two vertices in a stochastic block model or degree-corrected stochastic block model graph have the same block membership vectors, or testing whether two vertices in a popularity adjusted block model have the same community assignment. We propose several test statistics based on the empirical Mahalanobis distances between the $i$th and $j$th rows of either the adjacency or the normalized Laplacian spectral embedding of the graph. We show that, under mild conditions, these test statistics have limiting chi-square distributions under both the null and local alternative hypothesis, and we derived explicit expressions for the non-centrality parameters under the local alternative. Using these limit results, we address the model selection problems including choosing between the standard stochastic block model and its degree-corrected variant, and choosing between the ER model and stochastic block model. The effectiveness of our proposed tests are illustrated via both simulation studies and real data applications.

preprint2022arXiv

Popularity Adjusted Block Models are Generalized Random Dot Product Graphs

We connect two random graph models, the Popularity Adjusted Block Model (PABM) and the Generalized Random Dot Product Graph (GRDPG), by demonstrating that the PABM is a special case of the GRDPG in which communities correspond to mutually orthogonal subspaces of latent vectors. This insight allows us to construct new algorithms for community detection and parameter estimation for the PABM, as well as improve an existing algorithm that relies on Sparse Subspace Clustering. Using established asymptotic properties of Adjacency Spectral Embedding for the GRDPG, we derive asymptotic properties of these algorithms. In particular, we demonstrate that the absolute number of community detection errors tends to zero as the number of graph vertices tends to infinity. Simulation experiments illustrate these properties.

preprint2022arXiv

Vertex nomination between graphs via spectral embedding and quadratic programming

Given a network and a subset of interesting vertices whose identities are only partially known, the vertex nomination problem seeks to rank the remaining vertices in such a way that the interesting vertices are ranked at the top of the list. An important variant of this problem is vertex nomination in the multi-graphs setting. Given two graphs $G_1, G_2$ with common vertices and a vertex of interest $x \in G_1$, we wish to rank the vertices of $G_2$ such that the vertices most similar to $x$ are ranked at the top of the list. The current paper addresses this problem and proposes a method that first applies adjacency spectral graph embedding to embed the graphs into a common Euclidean space, and then solves a penalized linear assignment problem to obtain the nomination lists. Since the spectral embedding of the graphs are only unique up to orthogonal transformations, we present two approaches to eliminate this potential non-identifiability. One approach is based on orthogonal Procrustes and is applicable when there are enough vertices with known correspondence between the two graphs. Another approach uses adaptive point set registration and is applicable when there are few or no vertices with known correspondence. We show that our nomination scheme leads to accurate nomination under a generative model for pairs of random graphs that are approximately low-rank and possibly with pairwise edge correlations. We illustrate our algorithm's performance through simulation studies on synthetic data as well as analysis of a high-school friendship network and analysis of transition rates between web pages on the Bing search engine.

preprint2021arXiv

Supervised Dimensionality Reduction for Big Data

To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees.We introduce an approach, XOX, to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest ver-sion, "Linear Optimal Low-rank" projection (LOL), incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that LOL and its generalizations in the XOX framework lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of >150 million features, and several genomics datasets with>500,000 features, LOL outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

preprint2020arXiv

Numerical tolerance for spectral decompositions of random matrices

We precisely quantify the impact of statistical error in the quality of a numerical approximation to a random matrix eigendecomposition, and under mild conditions, we use this to introduce an optimal numerical tolerance for residual error in spectral decompositions of random matrices. We demonstrate that terminating an eigendecomposition algorithm when the numerical error and statistical error are of the same order results in computational savings with no loss of accuracy. We also repair a flaw in a ubiquitous termination condition, one in wide employ in several computational linear algebra implementations. We illustrate the practical consequences of our stopping criterion with an analysis of simulated and real networks. Our theoretical results and real-data examples establish that the tradeoff between statistical and numerical error is of significant import for data science.

preprint2020arXiv

On estimation and inference in latent structure random graphs

We define a latent structure model (LSM) random graph as a random dot product graph (RDPG) in which the latent position distribution incorporates both probabilistic and geometric constraints, delineated by a family of underlying distributions on some fixed Euclidean space, and a structural support submanifold from which the latent positions for the graph are drawn. For a one-dimensional latent structure model with known structural support, we show how spectral estimates of the latent positions of an RDPG can be used for efficient estimation of the paramaters of the LSM. We describe how to estimate or learn the structural support in cases where it is unknown, with an illustrative focus on graphs with latent positions along the Hardy-Weinberg curve. Finally, we use the latent structure model formulation to test bilateral homology in the Drosophila connectome.

preprint2020arXiv

On Two Distinct Sources of Nonidentifiability in Latent Position Random Graph Models

Two separate and distinct sources of nonidentifiability arise naturally in the context of latent position random graph models, though neither are unique to this setting. In this paper we define and examine these two nonidentifiabilities, dubbed subspace nonidentifiability and model-based nonidentifiability, in the context of random graph inference. We give examples where each type of nonidentifiability comes into play, and we show how in certain settings one need worry about one or the other type of nonidentifiability. Then, we characterize the limit for model-based nonidentifiability both with and without subspace nonidentifiability. We further obtain additional limiting results for covariances and $U$-statistics of stochastic block models and generalized random dot product graphs.

preprint2020arXiv

Two-sample Testing on Latent Distance Graphs With Unknown Link Functions

We propose a valid and consistent test for the hypothesis that two latent distance random graphs on the same vertex set have the same generating latent positions, up to some unidentifiable similarity transformations. Our test statistic is based on first estimating the edge probabilities matrices by truncating the singular value decompositions of the averaged adjacency matrices in each population and then computing a Spearman rank correlation coefficient between these estimates. Experimental results on simulated data indicate that the test procedure has power even when there is only one sample from each population, provided that the number of vertices is not too small. Application on a dataset of neural connectome graphs showed that we can distinguish between scans from different age groups while application on a dataset of epileptogenic recordings showed that we can discriminate between seizure and non-seizure events.