Source author record

Edgar Solomonik

Edgar Solomonik appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.NA Numerical Analysis Distributed, Parallel, and Cluster Computing Data Structures and Algorithms Machine Learning physics.comp-ph quant-ph cond-mat.str-el Discrete Mathematics Mathematical Software

Catalog footprint

What is connected

11works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Alternating Mahalanobis Distance Minimization for Stable and Accurate CP Decomposition

CP decomposition (CPD) is prevalent in chemometrics, signal processing, data mining and many more fields. While many algorithms have been proposed to compute the CPD, alternating least squares (ALS) remains one of the most widely used algorithm for computing the decomposition. Recent works have introduced the notion of eigenvalues and singular values of a tensor and explored applications of eigenvectors and singular vectors in areas like signal processing, data analytics and in various other fields. We introduce a new formulation for deriving singular values and vectors of a tensor by considering the critical points of a function different from what is used in the previous work. Computing these critical points in an alternating manner motivates an alternating optimization algorithm which corresponds to alternating least squares algorithm in the matrix case. However, for tensors with order greater than equal to $3$, it minimizes an objective function which is different from the commonly used least squares loss. Alternating optimization of this new objective leads to simple updates to the factor matrices with the same asymptotic computational cost as ALS. We show that a subsweep of this algorithm can achieve a superlinear convergence rate for exact CPD with known rank and verify it experimentally. We then view the algorithm as optimizing a Mahalanobis distance with respect to each factor with ground metric dependent on the other factors. This perspective allows us to generalize our approach to interpolate between updates corresponding to the ALS and the new algorithm to manage the tradeoff between stability and fitness of the decomposition. Our experimental results show that for approximating synthetic and real-world tensors, this algorithm and its variants converge to a better conditioned decomposition with comparable and sometimes better fitness as compared to the ALS algorithm.

preprint2022arXiv

Cost-efficient Gaussian Tensor Network Embeddings for Tensor-structured Inputs

This work discusses tensor network embeddings, which are random matrices ($S$) with tensor network structure. These embeddings have been used to perform dimensionality reduction of tensor network structured inputs $x$ and accelerate applications such as tensor decomposition and kernel regression. Existing works have designed embeddings for inputs $x$ with specific structures, such that the computational cost for calculating $Sx$ is efficient. We provide a systematic way to design tensor network embeddings consisting of Gaussian random tensors, such that for inputs with more general tensor network structures, both the sketch size (row size of $S$) and the sketching computational cost are low. We analyze general tensor network embeddings that can be reduced to a sequence of sketching matrices. We provide a sufficient condition to quantify the accuracy of such embeddings and derive sketching asymptotic cost lower bounds using embeddings that satisfy this condition and have a sketch size lower than any input dimension. We then provide an algorithm to efficiently sketch input data using such embeddings. The sketch size of the embedding used in the algorithm has a linear dependence on the number of sketching dimensions of the input. Assuming tensor contractions are performed with classical dense matrix multiplication algorithms, this algorithm achieves asymptotic cost within a factor of $O(\sqrt{m})$ of our cost lower bound, where $m$ is the sketch size. Further, when each tensor in the input has a dimension that needs to be sketched, this algorithm yields the optimal sketching asymptotic cost. We apply our sketching analysis to inexact tensor decomposition optimization algorithms. We provide a sketching algorithm for CP decomposition that is asymptotically faster than existing work in multiple regimes, and show optimality of an existing algorithm for tensor train rounding.

preprint2021arXiv

Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths

The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration's performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernel's performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernel's statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.

preprint2020arXiv

Comparison of Accuracy and Scalability of Gauss-Newton and Alternating Least Squares for CP Decomposition

Alternating least squares is the most widely used algorithm for CP tensor decomposition. However, alternating least squares may exhibit slow or no convergence, especially when high accuracy is required. An alternative approach is to regard CP decomposition as a nonlinear least squares problem and employ Newton-like methods. Direct solution of linear systems involving an approximated Hessian is generally expensive. However, recent advancements have shown that use of an implicit representation of the linear system makes these methods competitive with alternating least squares. We provide the first parallel implementation of a Gauss-Newton method for CP decomposition, which iteratively solves linear least squares problems at each Gauss-Newton step. In particular, we leverage a formulation that employs tensor contractions for implicit matrix-vector products within the conjugate gradient method. The use of tensor contractions enables us to employ the Cyclops library for distributed-memory tensor computations to parallelize the Gauss-Newton approach with a high-level Python implementation. In addition, we propose a regularization scheme for Gauss-Newton method to improve convergence properties without any additional cost. We study the convergence of variants of the Gauss-Newton method relative to ALS for finding exact CP decompositions as well as approximate decompositions of real-world tensors. We evaluate the performance of sequential and parallel versions of both approaches, and study the parallel scalability on the Stampede2 supercomputer.

preprint2020arXiv

Derivation and Analysis of Fast Bilinear Algorithms for Convolution

The prevalence of convolution in applications within signal processing, deep neural networks, and numerical solvers has motivated the development of numerous fast convolution algorithms. In many of these problems, convolution is performed on terabytes or petabytes of data, so even constant factors of improvement can significantly reduce the computation time. We leverage the formalism of bilinear algorithms to describe and analyze all of the most popular approaches. This unified lens permits us to study the relationship between different variants of convolution as well as to derive error bounds and analyze the cost of the various algorithms. We provide new derivations, which predominantly leverage matrix and tensor algebra, to describe the Winograd family of convolution algorithms as well as reductions between 1D and multidimensional convolution. We provide cost and error bounds as well as experimental numerical studies. Our experiments for two of these algorithms, the overlap-add approach and Winograd convolution algorithm with polynomials of degree greater than one, show that fast convolution algorithms can rival the accuracy of the fast Fourier transform (FFT) without using complex arithmetic. These algorithms can be used for convolution problems with multidimensional inputs or for filters larger than size of four, extending the state-of-the-art in Winograd-based convolution algorithms.

preprint2020arXiv

Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions

The Density Matrix Renormalization Group (DMRG) algorithm is a powerful tool for solving eigenvalue problems to model quantum systems. DMRG relies on tensor contractions and dense linear algebra to compute properties of condensed matter physics systems. However, its efficient parallel implementation is challenging due to limited concurrency, large memory footprint, and tensor sparsity. We mitigate these problems by implementing two new parallel approaches that handle block sparsity arising in DMRG, via Cyclops, a distributed memory tensor contraction library. We benchmark their performance on two physical systems using the Blue Waters and Stampede2 supercomputers. Our DMRG performance is improved by up to 5.9X in runtime and 99X in processing rate over ITensor, at roughly comparable computational resource use. This enables higher accuracy calculations via larger tensors for quantum state approximation. We demonstrate that despite having limited concurrency, DMRG is weakly scalable with the use of efficient parallel tensor contraction mechanisms.

preprint2020arXiv

Efficient 2D Tensor Network Simulation of Quantum Systems

Simulation of quantum systems is challenging due to the exponential size of the state space. Tensor networks provide a systematically improvable approximation for quantum states. 2D tensor networks such as Projected Entangled Pair States (PEPS) are well-suited for key classes of physical systems and quantum circuits. However, direct contraction of PEPS networks has exponential cost, while approximate algorithms require computations with large tensors. We propose new scalable algorithms and software abstractions for PEPS-based methods, accelerating the bottleneck operation of contraction and refactorization of a tensor subnetwork. We employ randomized SVD with an implicit matrix to reduce cost and memory footprint asymptotically. Further, we develop a distributed-memory PEPS library and study accuracy and efficiency of alternative algorithms for PEPS contraction and evolution on the Stampede2 supercomputer. We also simulate a popular near-term quantum algorithm, the Variational Quantum Eigensolver (VQE), and benchmark Imaginary Time Evolution (ITE), which compute ground states of Hamiltonians.

preprint2020arXiv

On Stability of Tensor Networks and Canonical Forms

Tensor networks such as matrix product states (MPS) and projected entangled pair states (PEPS) are commonly used to approximate quantum systems. These networks are optimized in methods such as DMRG or evolved by local operators. We provide bounds on the conditioning of tensor network representations to sitewise perturbations. These bounds characterize the extent to which local approximation error in the tensor sites of a tensor network can be amplified to error in the tensor it represents. In known tensor network methods, canonical forms of tensor network are used to minimize such error amplification. However, canonical forms are difficult to obtain for many tensor networks of interest. We quantify the extent to which error can be amplified in general tensor networks, yielding estimates of the benefit of the use of canonical forms. For the MPS and PEPS tensor networks, we provide simple forms on the worst-case error amplification. Beyond theoretical error bounds, we experimentally study the dependence of the error on the size of the network for perturbed random MPS tensor networks.

preprint2020arXiv

Pareto-Efficient Quantum Circuit Simulation Using Tensor Contraction Deferral

With the current rate of progress in quantum computing technologies, systems with more than 50 qubits will soon become reality. Computing ideal quantum state amplitudes for circuits of such and larger sizes is a fundamental step to assess both the correctness, performance, and scaling behavior of quantum algorithms and the fidelities of quantum devices. However, resource requirements for such calculations on classical computers grow exponentially. We show that deferring tensor contractions can extend the boundaries of what can be computed on classical systems. To demonstrate this technique, we present results obtained from a calculation of the complete set of output amplitudes of a universal random circuit with depth 27 in a 2D lattice of $7 \times 7$ qubits, and an arbitrarily selected slice of $2^{37}$ amplitudes of a universal random circuit with depth 23 in a 2D lattice of $8 \times 7$ qubits. Combining our methodology with other decomposition approaches found in the literature, we show that we can simulate $7 \times 7$-qubit random circuits to arbitrary depth by leveraging secondary storage. These calculations were thought to be impossible due to resource requirements.

preprint2016arXiv

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store $c$ copies of the symmetric matrix, our algorithm requires $Θ(\sqrt{c})$ less interprocessor communication than previously known algorithms, for any $c\leq p^{1/3}$ when using $p$ processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to $O(\log p)$ thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.

preprint2015arXiv

Sparse Tensor Algebra as a Parallel Programming Model

Dense and sparse tensors allow the representation of most bulk data structures in computational science applications. We show that sparse tensor algebra can also be used to express many of the transformations on these datasets, especially those which are parallelizable. Tensor computations are a natural generalization of matrix and graph computations. We extend the usual basic operations of tensor summation and contraction to arbitrary functions, and further operations such as reductions and mapping. The expression of these transformations in a high-level sparse linear algebra domain specific language allows our framework to understand their properties at runtime to select the preferred communication-avoiding algorithm. To demonstrate the efficacy of our approach, we show how key graph algorithms as well as common numerical kernels can be succinctly expressed using our interface and provide performance results of a general library implementation.

Edgar Solomonik

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Alternating Mahalanobis Distance Minimization for Stable and Accurate CP Decomposition

Cost-efficient Gaussian Tensor Network Embeddings for Tensor-structured Inputs

Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths

Comparison of Accuracy and Scalability of Gauss-Newton and Alternating Least Squares for CP Decomposition

Derivation and Analysis of Fast Bilinear Algorithms for Convolution

Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions

Efficient 2D Tensor Network Simulation of Quantum Systems

On Stability of Tensor Networks and Canonical Forms

Pareto-Efficient Quantum Circuit Simulation Using Tensor Contraction Deferral

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Sparse Tensor Algebra as a Parallel Programming Model