Source author record

Aravind Sankaran

Aravind Sankaran appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Performance Mathematical Software Distributed, Parallel, and Cluster Computing Machine Learning

Catalog footprint

What is connected

3works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch

Linear algebra operations, which are ubiquitous in machine learning, form major performance bottlenecks. The High-Performance Computing community invests significant effort in the development of architecture-specific optimized kernels, such as those provided by the BLAS and LAPACK libraries, to speed up linear algebra operations. However, end users are progressively less likely to go through the error prone and time-consuming process of directly using said kernels; instead, frameworks such as TensorFlow (TF) and PyTorch (PyT), which facilitate the development of machine learning applications, are becoming more and more popular. Although such frameworks link to BLAS and LAPACK, it is not clear whether or not they make use of linear algebra knowledge to speed up computations. For this reason, in this paper we develop benchmarks to investigate the linear algebra optimization capabilities of TF and PyT. Our analyses reveal that a number of linear algebra optimizations are still missing; for instance, reducing the number of scalar operations by applying the distributive law, and automatically identifying the optimal parenthesization of a matrix chain. In this work, we focus on linear algebra computations in TF and PyT; we both expose opportunities for performance enhancement to the benefit of the developers of the frameworks and provide end users with guidelines on how to achieve performance gains.

preprint2021arXiv

Performance Comparison for Scientific Computations on the Edge via Relative Performance

In a typical Internet-of-Things setting that involves scientific applications, a target computation can be evaluated in many different ways depending on the split of computations among various devices. On the one hand, different implementations (or algorithms)--equivalent from a mathematical perspective--might exhibit significant difference in terms of performance. On the other hand, some of the implementations are likely to show similar performance characteristics. In this paper, we focus on analyzing the performance of a given set of algorithms by clustering them into performance classes. To this end, we use a measurement-based approach to evaluate and score algorithms based on pair-wise comparisons; we refer to this approach as"Relative performance analysis". Each comparison yields one of three outcomes: one algorithm can be "better", "worse", or "equivalent" to another; those algorithms evaluating to have equivalent performance are merged into the same performance class. We show that our clustering methodology facilitates algorithm selection with respect to more than one metric; for instance, from the subset of equivalently fast algorithms, one could then select an algorithm that consumes the least energy on a certain device.

preprint2016arXiv

TTC: A Tensor Transposition Compiler for Multiple Architectures

We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to $8\times$ and $32\times$, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.