Source author record

Tamara G. Kolda

Tamara G. Kolda appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Social and Information Networks math.NA physics.soc-ph Numerical Analysis Machine Learning Distributed, Parallel, and Cluster Computing physics.data-an Data Structures and Algorithms Discrete Mathematics Mathematical Software Artificial Intelligence Computation Computational Engineering, Finance, and Science Methodology

Catalog footprint

What is connected

29works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Practical Leverage-Based Sampling for Low-Rank Tensor Decomposition

The low-rank canonical polyadic tensor decomposition is useful in data analysis and can be computed by solving a sequence of overdetermined least squares subproblems. Motivated by consideration of sparse tensors, we propose sketching each subproblem using leverage scores to select a subset of the rows, with probabilistic guarantees on the solution accuracy. We randomly sample rows proportional to leverage score upper bounds that can be efficiently computed using the special Khatri-Rao subproblem structure inherent in tensor decomposition. Crucially, for a $(d+1)$-way tensor, the number of rows in the sketched system is $O(r^d/ε)$ for a decomposition of rank $r$ and $ε$-accuracy in the least squares solve, independent of both the size and the number of nonzeros in the tensor. Along the way, we provide a practical solution to the generic matrix sketching problem of sampling overabundance for high-leverage-score rows, proposing to include such rows deterministically and combine repeated samples in the sketched system; we conjecture that this can lead to improved theoretical bounds. Numerical results on real-world large-scale tensors show the method is significantly faster than deterministic methods at nearly the same level of accuracy.

preprint2022arXiv

Randomized Algorithms for Scientific Computing (RASC)

Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021.

preprint2022arXiv

Tensor Moments of Gaussian Mixture Models: Theory and Applications

Gaussian mixture models (GMMs) are fundamental tools in statistical and data sciences. We study the moments of multivariate Gaussians and GMMs. The $d$-th moment of an $n$-dimensional random variable is a symmetric $d$-way tensor of size $n^d$, so working with moments naively is assumed to be prohibitively expensive for $d>2$ and larger values of $n$. In this work, we develop theory and numerical methods for \emph{implicit computations} with moment tensors of GMMs, reducing the computational and storage costs to $\mathcal{O}(n^2)$ and $\mathcal{O}(n^3)$, respectively, for general covariance matrices, and to $\mathcal{O}(n)$ and $\mathcal{O}(n)$, respectively, for diagonal ones. We derive concise analytic expressions for the moments in terms of symmetrized tensor products, relying on the correspondence between symmetric tensors and homogeneous polynomials, and combinatorial identities involving Bell polynomials. The primary application of this theory is to estimating GMM parameters (means and covariances) from a set of observations, when formulated as a moment-matching optimization problem. If there is a known and common covariance matrix, we also show it is possible to debias the data observations, in which case the problem of estimating the unknown means reduces to symmetric CP tensor decomposition. Numerical results validate and illustrate the numerical efficiency of our approaches. This work potentially opens the door to the competitiveness of the method of moments as compared to expectation maximization methods for parameter estimation of GMMs.

preprint2019arXiv

Generalized Canonical Polyadic Tensor Decomposition

Tensor decomposition is a fundamental unsupervised machine learning method in data science, with applications including network analysis and sensor data processing. This work develops a generalized canonical polyadic (GCP) low-rank tensor decomposition that allows other loss functions besides squared error. For instance, we can use logistic loss or Kullback-Leibler divergence, enabling tensor decomposition for binary or count data. We present a variety statistically-motivated loss functions for various scenarios. We provide a generalized framework for computing gradients and handling missing data that enables the use of standard optimization methods for fitting the model. We demonstrate the flexibility of GCP on several real-world examples including interactions in a social network, neural activity in a mouse, and monthly rainfall measurements in India.

preprint2019arXiv

TuckerMPI: A Parallel C++/MPI Software Package for Large-scale Data Compression via the Tucker Tensor Decomposition

Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its truncated Tucker decomposition, a higher-order analogue to the truncated singular value decomposition of a matrix. The result is a low-rank approximation of the original tensor-structured data. Compression efficiency is achieved by detecting latent global structure within the data, which we contrast to most compression methods that are focused on local structure. In this work, we describe TuckerMPI, our implementation of the truncated Tucker decomposition, including details of the data distribution and in-memory layouts, the parallel and serial implementations of the key kernels, and analysis of the storage, communication, and computational costs. We test the software on 4.5 terabyte and 6.7 terabyte data sets distributed across 100s of nodes (1000s of MPI processes), achieving compression rates between 100-200,000$\times$ which equates to 99-99.999% compression (depending on the desired accuracy) in substantially less time than it would take to even read the same dataset from a parallel filesystem. Moreover, we show that our method also allows for reconstruction of partial or down-sampled data on a single node, without a parallel computer so long as the reconstructed portion is small enough to fit on a single machine, e.g., in the instance of reconstructing/visualizing a single down-sampled time step or computing summary statistics.

preprint2016arXiv

Parallel Tensor Compression for Large-Scale Scientific Data

As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8~TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.

preprint2015arXiv

Symmetric Orthogonal Tensor Decomposition is Trivial

We consider the problem of decomposing a real-valued symmetric tensor as the sum of outer products of real-valued, pairwise orthogonal vectors. Such decompositions do not generally exist, but we show that some symmetric tensor decomposition problems can be converted to orthogonal problems following the whitening procedure proposed by Anandkumar et al. (2012). If an orthogonal decomposition of an $m$-way $n$-dimensional symmetric tensor exists, we propose a novel method to compute it that reduces to an $n \times n$ symmetric matrix eigenproblem. We provide numerical results demonstrating the effectiveness of the method.

preprint2014arXiv

Accelerating Community Detection by Using K-core Subgraphs

Community detection is expensive, and the cost generally depends at least linearly on the number of vertices in the graph. We propose working with a reduced graph that has many fewer nodes but nonetheless captures key community structure. The K-core of a graph is the largest subgraph within which each node has at least K connections. We propose a framework that accelerates community detection by applying an expensive algorithm (modularity optimization, the Louvain method, spectral clustering, etc.) to the K-core and then using an inexpensive heuristic (such as local modularity maximization) to infer community labels for the remaining nodes. Our experiments demonstrate that the proposed framework can reduce the running time by more than 80% while preserving the quality of the solutions. Recent theoretical investigations provide support for using the K-core as a reduced representation.

preprint2014arXiv

An Adaptive Shifted Power Method for Computing Generalized Tensor Eigenpairs

Several tensor eigenpair definitions have been put forth in the past decade, but these can all be unified under generalized tensor eigenpair framework, introduced by Chang, Pearson, and Zhang (2009). Given mth-order, n-dimensional real-valued symmetric tensors A and B, the goal is to find $λ\in R$ and $x \in R^n$, $x \neq 0$, such that $Ax^{m-1} = λBx^{m-1}$. Different choices for B yield different versions of the tensor eigenvalue problem. We present our generalized eigenproblem adaptive power method (GEAP) method for solving the problem, which is an extension of the shifted symmetric higher-order power method (SS-HOPM) for finding Z-eigenpairs. A major drawback of SS-HOPM was that its performance depended in choosing an appropriate shift, but our GEAP method also includes an adaptive method for choosing the shift automatically.

preprint2014arXiv

Directed closure measures for networks with reciprocity

The study of triangles in graphs is a standard tool in network analysis, leading to measures such as the \emph{transitivity}, i.e., the fraction of paths of length $2$ that participate in triangles. Real-world networks are often directed, and it can be difficult to "measure" this network structure meaningfully. We propose a collection of \emph{directed closure values} for measuring triangles in directed graphs in a way that is analogous to transitivity in an undirected graph. Our study of these values reveals much information about directed triadic closure. For instance, we immediately see that reciprocal edges have a high propensity to participate in triangles. We also observe striking similarities between the triadic closure patterns of different web and social networks. We perform mathematical and empirical analysis showing that directed configuration models that preserve reciprocity cannot capture the triadic closure patterns of real networks.

preprint2014arXiv

Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors

Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric tensor. We propose an algorithm-by-blocks, already shown of benefit for matrix computations, that exploits this storage format by utilizing a series of temporary tensors to avoid redundant computation. Further, partial symmetry within temporaries is exploited to further avoid redundant storage and redundant computation. A detailed analysis shows that, relative to storing and computing with tensors without taking advantage of symmetry and partial symmetry, storage requirements are reduced by a factor of $ O\left( m! \right)$ and computational requirements by a factor of $O\left( (m+1)!/2^m \right)$, where $ m $ is the order of the tensor. However, as the analysis shows, care must be taken in choosing the correct block size to ensure these storage and computational benefits are achieved (particularly for low-order tensors). An implementation demonstrates that storage is greatly reduced and the complexity introduced by storing and computing with tensors by blocks is manageable. Preliminary results demonstrate that computational time is also reduced. The paper concludes with a discussion of how insights in this paper point to opportunities for generalizing recent advances in the domain of linear algebra libraries to the field of multi-linear computation.

preprint2014arXiv

Using Triangles to Improve Community Detection in Directed Networks

In a graph, a community may be loosely defined as a group of nodes that are more closely connected to one another than to the rest of the graph. While there are a variety of metrics that can be used to specify the quality of a given community, one common theme is that flows tend to stay within communities. Hence, we expect cycles to play an important role in community detection. For undirected graphs, the importance of triangles -- an undirected 3-cycle -- has been known for a long time and can be used to improve community detection. In directed graphs, the situation is more nuanced. The smallest cycle is simply two nodes with a reciprocal connection, and using information about reciprocation has proven to improve community detection. Our new idea is based on the four types of directed triangles that contain cycles. To identify communities in directed networks, then, we propose an undirected edge-weighting scheme based on the type of the directed triangles in which edges are involved. We also propose a new metric on quality of the communities that is based on the number of 3-cycles that are split across communities. To demonstrate the impact of our new weighting, we use the standard METIS graph partitioning tool to determine communities and show experimentally that the resulting communities result in fewer 3-cycles being cut. The magnitude of the effect varies between a 10 and 50% reduction, and we also find evidence that this weighting scheme improves a task where plausible ground-truth communities are known.

preprint2014arXiv

Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs

Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderately-sized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider wedge sampling, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimation local clustering coefficients, degree-wise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than state-of-the-art alternatives.

preprint2013arXiv

A Scalable Generative Graph Model with Community Structure

Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdss-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop MapReduce implementation for a modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.

preprint2013arXiv

A Scalable Null Model for Directed Graphs Matching All Degree Distributions: In, Out, and Reciprocal

Degree distributions are arguably the most important property of real world networks. The classic edge configuration model or Chung-Lu model can generate an undirected graph with any desired degree distribution. This serves as a good null model to compare algorithms or perform experimental studies. Furthermore, there are scalable algorithms that implement these models and they are invaluable in the study of graphs. However, networks in the real-world are often directed, and have a significant proportion of reciprocal edges. A stronger relation exists between two nodes when they each point to one another (reciprocal edge) as compared to when only one points to the other (one-way edge). Despite their importance, reciprocal edges have been disregarded by most directed graph models. We propose a null model for directed graphs inspired by the Chung-Lu model that matches the in-, out-, and reciprocal-degree distributions of the real graphs. Our algorithm is scalable and requires $O(m)$ random numbers to generate a graph with $m$ edges. We perform a series of experiments on real datasets and compare with existing graph models.

preprint2013arXiv

An In-Depth Analysis of Stochastic Kronecker Graphs

Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing real-world graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its favorable properties and easy parallelization. Our goal is to provide a deeper understanding of the parameters and properties of this model so that its functionality as a benchmark is increased. We develop a rigorous mathematical analysis that shows this model cannot generate a power-law distribution or even a lognormal distribution. However, we formalize an enhanced version of the SKG model that uses random noise for smoothing. We prove both in theory and in practice that this enhancement leads to a lognormal distribution. Additionally, we provide a precise analysis of isolated vertices, showing that the graphs that are produced by SKG might be quite different than intended. For example, between 50% and 75% of the vertices in the Graph500 benchmarks will be isolated. Finally, we show that this model tends to produce extremely small core numbers (compared to most social networks and other real graphs) for common parameter choices.

preprint2013arXiv

Counting Triangles in Massive Graphs with MapReduce

Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for large-scale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with massive graphs. We show results on publicly-available networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 seconds per million edges plus overhead (approximately 225 seconds total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and non-social networks. To the best of our knowledge, these are the largest triangle-based graph computations published to date.

preprint2013arXiv

Dynamics of Trust Reciprocation in Heterogenous MMOG Networks

Understanding the dynamics of reciprocation is of great interest in sociology and computational social science. The recent growth of Massively Multi-player Online Games (MMOGs) has provided unprecedented access to large-scale data which enables us to study such complex human behavior in a more systematic manner. In this paper, we consider three different networks in the EverQuest2 game: chat, trade, and trust. The chat network has the highest level of reciprocation (33%) because there are essentially no barriers to it. The trade network has a lower rate of reciprocation (27%) because it has the obvious barrier of requiring more goods or money for exchange; morever, there is no clear benefit to returning a trade link except in terms of social connections. The trust network has the lowest reciprocation (14%) because this equates to sharing certain within-game assets such as weapons, and so there is a high barrier for such connections because they require faith in the players that are granted such high access. In general, we observe that reciprocation rate is inversely related to the barrier level in these networks. We also note that reciprocation has connections across the heterogeneous networks. Our experiments indicate that players make use of the medium-barrier reciprocations to strengthen a relationship. We hypothesize that lower-barrier interactions are an important component to predicting higher-barrier ones. We verify our hypothesis using predictive models for trust reciprocations using features from trade interactions. Using the number of trades (both before and after the initial trust link) boosts our ability to predict if the trust will be reciprocated up to 11% with respect to the AUC.

preprint2012arXiv

Degree Relations of Triangles in Real-world Networks and Models

Triangles are an important building block and distinguishing feature of real-world networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what these triangles look like. We initiate the study of degree-labeled triangles -- specifically, degree homogeneity versus heterogeneity in triangles. This yields new insight into the structure of real-world graphs. We observe that networks coming from social and collaborative situations are dominated by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other. On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles, i.e., the degrees in triangles are quite disparate. Surprisingly, nodes within the top 1% of degrees participate in the vast majority of triangles in heterogeneous graphs. We also ask the question of whether or not current graph models reproduce the types of triangles that are observed in real data and showed that most models fail to accurately capture these salient features.

preprint2012arXiv

On Tensors, Sparsity, and Nonnegative Factorizations

Tensors have found application in a variety of fields, ranging from chemometrics to signal processing and beyond. In this paper, we consider the problem of multilinear modeling of sparse count data. Our goal is to develop a descriptive tensor factorization model of such data, along with appropriate algorithms and theory. To do so, we propose that the random variation is best described via a Poisson distribution, which better describes the zeros observed in the data as compared to the typical assumption of a Gaussian distribution. Under a Poisson assumption, we fit a model to observed data using the negative log-likelihood score. We present a new algorithm for Poisson tensor factorization called CANDECOMP-PARAFAC Alternating Poisson Regression (CP-APR) that is based on a majorization-minimization approach. It can be shown that CP-APR is a generalization of the Lee-Seung multiplicative updates. We show how to prevent the algorithm from converging to non-KKT points and prove convergence of CP-APR under mild conditions. We also explain how to implement CP-APR for large-scale sparse tensors and present results on several data sets, both real and simulated.

preprint2012arXiv

Triadic Measures on Graphs: The Power of Wedge Sampling

Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We propose a new method based on wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph. Our methods come with provable and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state-of-the-art, while providing nearly the accuracy of full enumeration. Our results will enable more wide-scale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several real-world examples.

preprint2011arXiv

All-at-once Optimization for Coupled Matrix and Tensor Factorizations

Joint analysis of data from multiple sources has the potential to improve our understanding of the underlying structures in complex data sets. For instance, in restaurant recommendation systems, recommendations can be based on rating histories of customers. In addition to rating histories, customers' social networks (e.g., Facebook friendships) and restaurant categories information (e.g., Thai or Italian) can also be used to make better recommendations. The task of fusing data, however, is challenging since data sets can be incomplete and heterogeneous, i.e., data consist of both matrices, e.g., the person by person social network matrix or the restaurant by category matrix, and higher-order tensors, e.g., the "ratings" tensor of the form restaurant by meal by person. In this paper, we are particularly interested in fusing data sets with the goal of capturing their underlying latent structures. We formulate this problem as a coupled matrix and tensor factorization (CMTF) problem where heterogeneous data sets are modeled by fitting outer-product models to higher-order tensors and matrices in a coupled manner. Unlike traditional approaches solving this problem using alternating algorithms, we propose an all-at-once optimization approach called CMTF-OPT (CMTF-OPTimization), which is a gradient-based optimization approach for joint analysis of matrices and higher-order tensors. We also extend the algorithm to handle coupled incomplete data sets. Using numerical experiments, we demonstrate that the proposed all-at-once approach is more accurate than the alternating least squares approach.

preprint2011arXiv

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more.

preprint2011arXiv

Community structure and scale-free collections of Erdös-Rényi graphs

Community structure plays a significant role in the analysis of social networks and similar graphs, yet this structure is little understood and not well captured by most models. We formally define a community to be a subgraph that is internally highly connected and has no deeper substructure. We use tools of combinatorics to show that any such community must contain a dense Erdös-Rényi (ER) subgraph. Based on mathematical arguments, we hypothesize that any graph with a heavy-tailed degree distribution and community structure must contain a scale free collection of dense ER subgraphs. These theoretical observations corroborate well with empirical evidence. From this, we propose the Block Two-Level Erdös-Rényi (BTER) model, and demonstrate that it accurately captures the observable properties of many real-world social networks.

preprint2011arXiv

Shifted Power Method for Computing Tensor Eigenpairs

Recent work on eigenvalues and eigenvectors for tensors of order m >= 3 has been motivated by applications in blind source separation, magnetic resonance imaging, molecular conformation, and more. In this paper, we consider methods for computing real symmetric-tensor eigenpairs of the form Ax^{m-1} = λx subject to ||x||=1, which is closely related to optimal rank-1 approximation of a symmetric tensor. Our contribution is a shifted symmetric higher-order power method (SS-HOPM), which we show is guaranteed to converge to a tensor eigenpair. SS-HOPM can be viewed as a generalization of the power iteration method for matrices or of the symmetric higher-order power method. Additionally, using fixed point analysis, we can characterize exactly which eigenpairs can and cannot be found by the method. Numerical examples are presented, including examples from an extension of the method to finding complex eigenpairs.

preprint2011arXiv

The Similarity between Stochastic Kronecker and Chung-Lu Graph Models

The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy parallelization and ability to mirror real data. Although SKG is easy to implement, there is little understanding of the properties and behavior of this model. We show that the parallel variant of the edge-configuration model given by Chung and Lu (referred to as CL) is notably similar to the SKG model. The graph properties of an SKG are extremely close to those of a CL graph generated with the appropriate parameters. Indeed, the final probability matrix used by SKG is almost identical to that of a CL model. This implies that the graph distribution represented by SKG is almost the same as that given by a CL model. We also show that when it comes to fitting real data, CL performs as well as SKG based on empirical studies of graph properties. CL has the added benefit of a trivially simple fitting procedure and exactly matching the degree distribution. Our results suggest that users of the SKG model should consider the CL model because of its similar properties, simpler structure, and ability to fit a wider range of degree distributions. At the very least, CL is a good control model to compare against.

preprint2010arXiv

Making Tensor Factorizations Robust to Non-Gaussian Noise

Tensors are multi-way arrays, and the Candecomp/Parafac (CP) tensor factorization has found application in many different domains. The CP model is typically fit using a least squares objective function, which is a maximum likelihood estimate under the assumption of i.i.d. Gaussian noise. We demonstrate that this loss function can actually be highly sensitive to non-Gaussian noise. Therefore, we propose a loss function based on the 1-norm because it can accommodate both Gaussian and grossly non-Gaussian perturbations. We also present an alternating majorization-minimization algorithm for fitting a CP model using our proposed loss function.

preprint2010arXiv

Scalable Tensor Factorizations for Incomplete Data

The problem of incomplete data - i.e., data with missing or unknown values - in multi-way arrays is ubiquitous in biomedical signal processing, network traffic analysis, bibliometrics, social network analysis, chemometrics, computer vision, communication networks, etc. We consider the problem of how to factorize data sets with missing values with the goal of capturing the underlying latent structure of the data and possibly reconstructing missing values (i.e., tensor completion). We focus on one of the most well-known tensor factorizations that captures multi-linear structure, CANDECOMP/PARAFAC (CP). In the presence of missing data, CP can be formulated as a weighted least squares problem that models only the known entries. We develop an algorithm called CP-WOPT (CP Weighted OPTimization) that uses a first-order optimization approach to solve the weighted least squares problem. Based on extensive numerical experiments, our algorithm is shown to successfully factorize tensors with noise and up to 99% missing data. A unique aspect of our approach is that it scales to sparse large-scale data, e.g., 1000 x 1000 x 1000 with five million known entries (0.5% dense). We further demonstrate the usefulness of CP-WOPT on two real-world applications: a novel EEG (electroencephalogram) application where missing data is frequently encountered due to disconnections of electrodes and the problem of modeling computer network traffic where data may be absent due to the expense of the data collection process.

preprint2010arXiv

Temporal Link Prediction using Matrix and Tensor Factorizations

The data in many disciplines such as social networks, web analysis, etc. is link-based, and the link structure can be exploited for many different data mining tasks. In this paper, we consider the problem of temporal link prediction: Given link data for times 1 through T, can we predict the links at time T+1? If our data has underlying periodic structure, can we predict out even further in time, i.e., links at time T+2, T+3, etc.? In this paper, we consider bipartite graphs that evolve over time and consider matrix- and tensor-based methods for predicting future links. We present a weight-based method for collapsing multi-year data into a single matrix. We show how the well-known Katz method for link prediction can be extended to bipartite graphs and, moreover, approximated in a scalable way using a truncated singular value decomposition. Using a CANDECOMP/PARAFAC tensor decomposition of the data, we illustrate the usefulness of exploiting the natural three-dimensional structure of temporal link data. Through several numerical experiments, we demonstrate that both matrix- and tensor-based techniques are effective for temporal link prediction despite the inherent difficulty of the problem. Additionally, we show that tensor-based techniques are particularly effective for temporal data with varying periodic patterns.

Tamara G. Kolda

What is connected

Connect this record

See the researcher in context

Building this map preview

29 published item(s)

Practical Leverage-Based Sampling for Low-Rank Tensor Decomposition

Randomized Algorithms for Scientific Computing (RASC)

Tensor Moments of Gaussian Mixture Models: Theory and Applications

Generalized Canonical Polyadic Tensor Decomposition

TuckerMPI: A Parallel C++/MPI Software Package for Large-scale Data Compression via the Tucker Tensor Decomposition

Parallel Tensor Compression for Large-Scale Scientific Data

Symmetric Orthogonal Tensor Decomposition is Trivial

Accelerating Community Detection by Using K-core Subgraphs

An Adaptive Shifted Power Method for Computing Generalized Tensor Eigenpairs

Directed closure measures for networks with reciprocity

Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors

Using Triangles to Improve Community Detection in Directed Networks

Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs

A Scalable Generative Graph Model with Community Structure

A Scalable Null Model for Directed Graphs Matching All Degree Distributions: In, Out, and Reciprocal

An In-Depth Analysis of Stochastic Kronecker Graphs

Counting Triangles in Massive Graphs with MapReduce

Dynamics of Trust Reciprocation in Heterogenous MMOG Networks

Degree Relations of Triangles in Real-world Networks and Models

On Tensors, Sparsity, and Nonnegative Factorizations

Triadic Measures on Graphs: The Power of Wedge Sampling

All-at-once Optimization for Coupled Matrix and Tensor Factorizations

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

Community structure and scale-free collections of Erdös-Rényi graphs

Shifted Power Method for Computing Tensor Eigenpairs

The Similarity between Stochastic Kronecker and Chung-Lu Graph Models

Making Tensor Factorizations Robust to Non-Gaussian Noise

Scalable Tensor Factorizations for Incomplete Data

Temporal Link Prediction using Matrix and Tensor Factorizations