Source author record

Sivasankaran Rajamanickam

Sivasankaran Rajamanickam appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Hardware Architecture math.NA Numerical Analysis Data Structures and Algorithms Discrete Mathematics Mathematical Software

Catalog footprint

What is connected

11works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

Jet: Multilevel Graph Partitioning on Graphics Processing Units

The multilevel heuristic is the dominant strategy for high-quality sequential and parallel graph partitioning. Partition refinement is a key step of multilevel graph partitioning. In this work, we present Jet, a new parallel algorithm for partition refinement specifically designed for Graphics Processing Units (GPUs). We combine Jet with GPU-aware coarsening to develop a $k$-way graph partitioner, the Jet partitioner. The new partitioner achieves superior quality compared to state-of-the-art shared memory partitioners on a large collection of test graphs.

preprint2022arXiv

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Recently, numerous sparse hardware accelerators for Deep Neural Networks (DNNs), Graph Neural Networks (GNNs), and scientific computing applications have been proposed. A common characteristic among all of these accelerators is that they target tensor algebra (typically matrix multiplications); yet dozens of new accelerators are proposed for every new application. The motivation is that the size and sparsity of the workloads heavily influence which architecture is best for memory and computation efficiency. To satisfy the growing demand of efficient computations across a spectrum of workloads on large data centers, we propose deploying a flexible 'heterogeneous' accelerator, which contains many 'sub-accelerators' (smaller specialized accelerators) working together. To this end, we propose: (1) HARD TACO, a quick and productive C++ to RTL design flow to generate many types of sub-accelerators for sparse and dense computations for fair design-space exploration, (2) AESPA, a heterogeneous sparse accelerator design template constructed with the sub-accelerators generated from HARD TACO, and (3) a suite of scheduling strategies to map tensor kernels onto heterogeneous sparse accelerators with high efficiency and utilization. AESPA with optimized scheduling achieves 1.96X higher performance, and 7.9X better energy-delay product (EDP) than a Homogeneous EIE-like accelerator with our diverse workload suite.

preprint2022arXiv

Parallel, Portable Algorithms for Distance-2 Maximal Independent Set and Graph Coarsening

Given a graph, finding the distance-2 maximal independent set (MIS-2) of the vertices is a problem that is useful in several contexts such as algebraic multigrid coarsening or multilevel graph partitioning. Such multilevel methods rely on finding the independent vertices so they can be used as seeds for aggregation in a multilevel scheme. We present a parallel MIS-2 algorithm to improve performance on modern accelerator hardware. This algorithm is implemented using the Kokkos programming model to enable performance portability. We demonstrate the portability of the algorithm and the performance on a variety of architectures (x86/ARM CPUs and NVIDIA/AMD GPUs). The resulting algorithm is also deterministic, producing an identical result for a given input across all of these platforms. The new MIS-2 implementation outperforms implementations in state of the art libraries like CUSP and ViennaCL by 3-8x while producing similar quality results. We further demonstrate the benefits of this approach by developing parallel graph coarsening scheme for two different use cases. First, we develop an algebraic multigrid (AMG) aggregation scheme using parallel MIS-2 and demonstrate the benefits as opposed to previous approaches used in the MueLu multigrid package in Trilinos. We also describe an approach for implementing a parallel multicolor "cluster" Gauss-Seidel preconditioner using this MIS-2 coarsening, and demonstrate better performance with an efficient, parallel, multicolor Gauss-Seidel algorithm.

preprint2022arXiv

PGAbB: A Block-Based Graph Processing Framework for Heterogeneous Platforms

Designing flexible graph kernels that can run well on various platforms is a crucial research problem due to the frequent usage of graphs for modeling data and recent architectural advances and variety. In this work, we propose a novel graph processing framework, PGAbB (Parallel Graph Algorithms by Blocks), for modern shared-memory heterogeneous platforms. Our framework implements a block-based programming model. This allows a user to express a graph algorithm using kernels that operate on subgraphs. PGAbB support graph computations that fit in host DRAM but not in GPU device memory, and provides simple but effective scheduling techniques to schedule computations to all available resources in a heterogeneous architecture. We have demonstrated that one can easily implement a diverse set of graph algorithms in our framework by developing five algorithms. Our experimental results show that PGAbB implementations achieve better or competitive performance compared to hand-optimized implementations. Based on our experiments on five graph algorithms and forty-four graphs, in the median, PGAbB achieves 1.6, 1.6, 5.7, 3.4, 4.5, and 2.4 times better performance than GAPBS, Galois, Ligra, LAGraph Galois-GPU, and Gunrock graph processing systems, respectively.

preprint2022arXiv

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of reconfigurable dataflow (aka spatial) accelerators offers promise for acceleration by mapping optimized dataflows (i.e., computation order and parallelism) for both phases. The goal of this work is to characterize and understand the design-space of dataflow choices for running GNNs on spatial accelerators in order for mappers or design-space exploration tools to optimize the dataflow based on the workload. Specifically, we propose a taxonomy to describe all possible choices for mapping the dense and sparse phases of GNN inference, spatially and temporally over a spatial accelerator, capturing both the intra-phase dataflow and the inter-phase (pipelined) dataflow. Using this taxonomy, we do deep-dives into the cost and benefits of several dataflows and perform case studies on implications of hardware parameters for dataflows and value of flexibility to support pipelined execution.

preprint2020arXiv

An Algebraic Sparsified Nested Dissection Algorithm Using Low-Rank Approximations

We propose a new algorithm for the fast solution of large, sparse, symmetric positive-definite linear systems, spaND -- sparsified Nested Dissection. It is based on nested dissection, sparsification and low-rank compression. After eliminating all interiors at a given level of the elimination tree, the algorithm sparsifies all separators corresponding to the interiors. This operation reduces the size of the separators by eliminating some degrees of freedom but without introducing any fill-in. This is done at the expense of a small and controllable approximation error. The result is an approximate factorization that can be used as an efficient preconditioner. We then perform several numerical experiments to evaluate this algorithm. We demonstrate that a version using orthogonal factorization and block-diagonal scaling takes fewer CG iterations to converge than previous similar algorithms on various kinds of problems. Furthermore, this algorithm is provably guaranteed to never break down and the matrix stays symmetric positive-definite throughout the process. We evaluate the algorithm on some large problems and show it exhibits near-linear scaling. The factorization time is roughly O(N) and the number of iterations grows slowly with N.

preprint2020arXiv

Asynchronous One-Level and Two-Level Domain Decomposition Solvers

Parallel implementations of linear iterative solvers generally alternate between phases of data exchange and phases of local computation. Increasingly large problem sizes on more heterogeneous systems make load balancing and network layout very challenging tasks. In particular, global communication patterns such as inner products become increasingly limiting at scale. We explore the use of asynchronous communication based on one-sided MPI primitives in a multitude of domain decomposition solvers. In particular, a scalable asynchronous two-level method is presented. We discuss practical issues encountered in the development of a scalable solver and show experimental results obtained on state-of-the-art supercomputer systems that illustrate the benefits of asynchronous solvers in load balanced as well as load imbalanced scenarios. Using the novel method, we can observe speed-ups of up to 4x over its classical synchronous equivalent.

preprint2017arXiv

Distributed Graph Layout for Scalable Small-world Network Analysis

The in-memory graph layout or organization has a considerable impact on the time and energy efficiency of distributed memory graph computations. It affects memory locality, inter-task load balance, communication time, and overall memory utilization. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, and reordering vertex and edge identifiers. In this work, we present DGL, a fast, parallel, and memory-efficient distributed graph layout strategy that is specifically designed for small-world networks (low-diameter graphs with skewed vertex degree distributions). Label propagation-based partitioning and a scalable BFS-based ordering are the main steps in the layout strategy. We show that the DGL layout can significantly improve end-to-end performance of five challenging graph analytics workloads: PageRank, a parallel subgraph enumeration program, tuned implementations of breadth-first search and single-source shortest paths, and RDF3X-MPI, a distributed SPARQL query processing engine. Using these benchmarks, we additionally offer a comprehensive analysis on how graph layout affects the performance of graph analytics with variable computation and communication characteristics.

preprint2016arXiv

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Scalable sparse LU factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the Gilbert-Peierls algorithm for sparse LU factorization. As architectures evolve, there exists a need for algorithms that are hierarchical in nature to match the hierarchy in thread teams, individual threads, and vector level parallelism. Basker is designed to map well to this hierarchy in architectures. There is also a need for data layouts to match multiple levels of hierarchy in memory. Basker uses a two-dimensional hierarchical structure of sparse matrices that maps to the hierarchy in the memory architectures and to the hierarchy in parallelism. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulations. Basker achieves a geometric mean speedup of 5.91x on CPU (16 cores) and 7.4x on Xeon Phi (32 cores) relative to KLU. Basker outperforms Intel MKL Pardiso (PMKL) by as much as 53x on CPU (16 cores) and 13.3x on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4x speedup on a challenging matrix sequence taken from an actual Xyce simulation.

preprint2016arXiv

Partitioning Trillion-edge Graphs in Minutes

We introduce XtraPuLP, a new distributed-memory graph partitioner designed to process trillion-edge graphs. XtraPuLP is based on the scalable label propagation community detection technique, which has been demonstrated as a viable means to produce high quality partitions with minimal computation time. On a collection of large sparse graphs, we show that XtraPuLP partitioning quality is comparable to state-of-the-art partitioning methods. We also demonstrate that XtraPuLP can produce partitions of real-world graphs with billion+ vertices in minutes. Further, we show that using XtraPuLP partitions for distributed-memory graph analytics leads to significant end-to-end execution time reduction.

preprint2016arXiv

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-by-blocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Cholesky-by-blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.

Sivasankaran Rajamanickam

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Jet: Multilevel Graph Partitioning on Graphics Processing Units

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Parallel, Portable Algorithms for Distance-2 Maximal Independent Set and Graph Coarsening

PGAbB: A Block-Based Graph Processing Framework for Heterogeneous Platforms

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

An Algebraic Sparsified Nested Dissection Algorithm Using Low-Rank Approximations

Asynchronous One-Level and Two-Level Domain Decomposition Solvers

Distributed Graph Layout for Scalable Small-world Network Analysis

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Partitioning Trillion-edge Graphs in Minutes

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout