Source author record

Olga Holtz

Olga Holtz appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Complexity math.NA Numerical Analysis math.CA math.CO math.RA Distributed, Parallel, and Cluster Computing math.HO math.AC math.CV math.PR

Catalog footprint

What is connected

17works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Communication Bounds for Convolutional Neural Networks

Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new lower bounds on data movement for mixed precision convolutions in both single-processor and parallel distributed memory models, as well as algorithms that outperform current implementations such as Im2Col. We obtain performance figures using GEMMINI, a machine learning accelerator, where our tiling provides improvements between 13% and 150% over a vendor supplied algorithm.

preprint2020arXiv

Sparsifying the Operators of Fast Matrix Multiplication Algorithms

Fast matrix multiplication algorithms may be useful, provided that their running time is good in practice. Particularly, the leading coefficient of their arithmetic complexity needs to be small. Many sub-cubic algorithms have large leading coefficients, rendering them impractical. Karstadt and Schwartz (SPAA'17, JACM'20) demonstrated how to reduce these coefficients by sparsifying an algorithm's bilinear operator. Unfortunately, the problem of finding optimal sparsifications is NP-Hard. We obtain three new methods to this end, and apply them to existing fast matrix multiplication algorithms, thus improving their leading coefficients. These methods have an exponential worst case running time, but run fast in practice and improve the performance of many fast matrix multiplication algorithms. Two of the methods are guaranteed to produce leading coefficients that, under some assumptions, are optimal.

preprint2015arXiv

Generalized Hurwitz matrices, generalized Euclidean algorithm, and forbidden sectors of the complex plane

Given a polynomial \[ f(x)=a_0x^n+a_1x^{n-1}+\cdots +a_n \] with positive coefficients $a_k$, and a positive integer $M\leq n$, we define a(n infinite) generalized Hurwitz matrix $H_M(f):=(a_{Mj-i})_{i,j}$. We prove that the polynomial $f(z)$ does not vanish in the sector $$ \left\{z\in\mathbb{C}: |\arg (z)| < \fracπ{M}\right\} $$ whenever the matrix $H_M$ is totally nonnegative. This result generalizes the classical Hurwitz' Theorem on stable polynomials ($M=2$), the Aissen-Edrei-Schoenberg-Whitney theorem on polynomials with negative real roots ($M=1$), and the Cowling-Thron theorem ($M=n$). In this connection, we also develop a generalization of the classical Euclidean algorithm, of independent interest per se.

preprint2012arXiv

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA'11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of nodes ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.

preprint2012arXiv

Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication

Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to be optimal.

preprint2012arXiv

Matrices that commute with their derivative. On a letter from Schur to Wielandt

We examine when a matrix whose elements are differentiable functions in one variable commutes with its derivative. This problem was discussed in a letter from Issai Schur to Helmut Wielandt written in 1934, which we found in Wielandt's Nachlass. We present this letter and its translation into English. The topic was rediscovered later and partial results were proved. However, there are many subtle observations in Schur's letter which were not obtained in later years. Using an algebraic setting, we put these into perspective and extend them in several directions. We present in detail the relationship between several conditions mentioned in Schur's letter and we focus in particular on the characterization of matrices called Type 1 by Schur. We also present several examples that demonstrate Schur's observations.

preprint2012arXiv

Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.

preprint2011arXiv

Graph Expansion and Communication Costs of Fast Matrix Multiplication

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size $M$, too small to store three $n$-by-$n$ matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms, $Ω((\frac{n}{\sqrt M})^{ω_0}\cdot M)$, where $ω_0$ is the exponent in the arithmetic count (e.g., $ω_0 = \lg 7$ for Strassen, and $ω_0 = 3$ for conventional matrix multiplication). With $p$ parallel processors, each with fast memory of size $M$, the lower bound is $p$ times smaller. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation).

preprint2011arXiv

Structured matrices, continued fractions, and root localization of polynomials

We give a detailed account of various connections between several classes of objects: Hankel, Hurwitz, Toeplitz, Vandermonde and other structured matrices, Stietjes and Jacobi-type continued fractions, Cauchy indices, moment problems, total positivity, and root localization of univariate polynomials. Along with a survey of many classical facts, we provide a number of new results.

preprint2011arXiv

Szegő's theorem for matrix orthogonal polynomials

We extend some classical theorems in the theory of orthogonal polynomials on the unit circle to the matrix case. In particular, we prove a matrix analogue of Szegő's theorem. As a by-product, we also obtain an elementary proof of the distance formula by Helson and Lowdenslager.

preprint2011arXiv

Zonotopal algebra

A wealth of geometric and combinatorial properties of a given linear endomorphism $X$ of $\R^N$ is captured in the study of its associated zonotope $Z(X)$, and, by duality, its associated hyperplane arrangement ${\cal H}(X)$. This well-known line of study is particularly interesting in case $n\eqbd\rank X \ll N$. We enhance this study to an algebraic level, and associate $X$ with three algebraic structures, referred herein as {\it external, central, and internal.} Each algebraic structure is given in terms of a pair of homogeneous polynomial ideals in $n$ variables that are dual to each other: one encodes properties of the arrangement ${\cal H}(X)$, while the other encodes by duality properties of the zonotope $Z(X)$. The algebraic structures are defined purely in terms of the combinatorial structure of $X$, but are subsequently proved to be equally obtainable by applying suitable algebro-analytic operations to either of $Z(X)$ or ${\cal H}(X)$. The theory is universal in the sense that it requires no assumptions on the map $X$ (the only exception being that the algebro-analytic operations on $Z(X)$ yield sought-for results only in case $X$ is unimodular), and provides new tools that can be used in enumerative combinatorics, graph theory, representation theory, polytope geometry, and approximation theory.

preprint2010arXiv

Communication-optimal Parallel and Sequential Cholesky Decomposition

Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.

preprint2010arXiv

Hierarchical zonotopal spaces

Zonotopal algebra interweaves algebraic, geometric and combinatorial properties of a given linear map X. Of basic significance in this theory is the fact that the algebraic structures are derived from the geometry (via a non-linear procedure known as "the least map"), and that the statistics of the algebraic structures (e.g., the Hilbert series of various polynomial ideals) are combinatorial, i.e., computable using a simple discrete algorithm known as "the valuation function". On the other hand, the theory is somewhat rigid since it deals, for the given X, with exactly two pairs each of which is made of a nested sequence of three ideals: an external ideal (the smallest), a central ideal (the middle), and an internal ideal (the largest). In this paper we show that the fundamental principles of zonotopal algebra as described in the previous paragraph extend far beyond the setup of external, central and internal ideals by building a whole hierarchy of new combinatorially defined zonotopal spaces.

preprint2010arXiv

New coins from old, smoothly

Given a (known) function $f:[0,1] \to (0,1)$, we consider the problem of simulating a coin with probability of heads $f(p)$ by tossing a coin with unknown heads probability $p$, as well as a fair coin, $N$ times each, where $N$ may be random. The work of Keane and O'Brien (1994) implies that such a simulation scheme with the probability $¶_p(N<\infty)$ equal to 1 exists iff $f$ is continuous. Nacu and Peres (2005) proved that $f$ is real analytic in an open set $S \subset (0,1)$ iff such a simulation scheme exists with the probability $¶_p(N>n)$ decaying exponentially in $n$ for every $p \in S$. We prove that for $α>0$ non-integer, $f$ is in the space $C^α[0,1]$ if and only if a simulation scheme as above exists with $¶_p(N>n) \le C (Δ_n(p))^α$, where $Δ_n(x)\eqbd \max \{\sqrt{x(1-x)/n},1/n \}$. The key to the proof is a new result in approximation theory: Let $\B_n$ be the cone of univariate polynomials with nonnegative Bernstein coefficients of degree $n$. We show that a function $f:[0,1] \to (0,1)$ is in $C^α[0,1]$ if and only if $f$ has a series representation $\sum_{n=1}^\infty F_n$ with $F_n \in \B_n$ and $\sum_{k>n} F_k(x) \le C(Δ_n(x))^α$ for all $ x \in [0,1]$ and $n \ge 1$. We also provide a counterexample to a theorem stated without proof by Lorentz (1963), who claimed that if some $ϕ_n \in \B_n$ satisfy $|f(x)-ϕ_n(x)| \le C (Δ_n(x))^α$ for all $ x \in [0,1]$ and $n \ge 1$, then $f \in C^α[0,1]$.

preprint2009arXiv

Computational Complexity and Numerical Stability of Linear Problems

We survey classical and recent developments in numerical linear algebra, focusing on two issues: computational complexity, or arithmetic costs, and numerical stability, or performance under roundoff error. We present a brief account of the algebraic complexity theory as well as the general error analysis for matrix multiplication and related problems. We emphasize the central role played by the matrix multiplication problem and discuss historical and modern approaches to its solution.

preprint2009arXiv

Minimizing Communication in Linear Algebra

In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#arithmetic operations / $\sqrt{M}$), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, $LDL^T$ factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.

preprint2007arXiv

Fast linear algebra is stable

In an earlier paper, we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of $n$-by-$n$ matrices can be done by any algorithm in $O(n^{ω+ η})$ operations for any $η> 0$, then it can be done stably in $O(n^{ω+ η})$ operations for any $η> 0$. Here we extend this result to show that essentially all standard linear algebra operations, including LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably (in a normwise sense) in $O(n^{ω+ η})$ operations.

Olga Holtz

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Communication Bounds for Convolutional Neural Networks

Sparsifying the Operators of Fast Matrix Multiplication Algorithms

Generalized Hurwitz matrices, generalized Euclidean algorithm, and forbidden sectors of the complex plane

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication

Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication

Matrices that commute with their derivative. On a letter from Schur to Wielandt

Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

Graph Expansion and Communication Costs of Fast Matrix Multiplication

Structured matrices, continued fractions, and root localization of polynomials

Szegő's theorem for matrix orthogonal polynomials

Zonotopal algebra

Communication-optimal Parallel and Sequential Cholesky Decomposition

Hierarchical zonotopal spaces

New coins from old, smoothly

Computational Complexity and Numerical Stability of Linear Problems

Minimizing Communication in Linear Algebra

Fast linear algebra is stable