Source author record

Keshav Pingali

Keshav Pingali appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Data Structures and Algorithms Machine Learning Programming Languages Computational Engineering, Finance, and Science Robotics Social and Information Networks

Catalog footprint

What is connected

10works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that utilizes only a single flow policy iteration and requires only a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.

preprint2022arXiv

Scalable Hypergraph Embedding System

Many problems such as node classification and link prediction in network data can be solved using graph embeddings. However, it is difficult to use graphs to capture non-binary relations such as communities of nodes. These kinds of complex relations are expressed more naturally as hypergraphs. While hypergraphs are a generalization of graphs, state-of-the-art graph embedding techniques are not adequate for solving prediction and classification tasks on large hypergraphs accurately in reasonable time. In this paper, we introduce HyperNetVec, a novel hierarchical framework for scalable unsupervised hypergraph embedding. HyperNetVec exploits shared-memory parallelism and is capable of generating high quality embeddings for real-world hypergraphs with millions of nodes and hyperedges in only a couple of minutes while existing hypergraph systems either fail for such large hypergraphs or may take days to produce the embeddings.

preprint2021arXiv

Supermodeling of tumor dynamics with parallel isogeometric analysis solver

Supermodeling is a modern, model-ensembling paradigm that integrates several self-synchronized imperfect sub-models by controlling a few meta-parameters to generate more accurate predictions of complex systems' dynamics. Continual synchronization between sub-models allows for trajectory predictions with superior accuracy compared to a single model or a classical ensemble of independent models whose decision fusion is based on the majority voting or averaging the outcomes. However, it comes out from numerous observations that the supermodeling procedure's convergence depends on a few principal factors such as (1) the number of sub-models, (2) their proper selection, and (3) the choice of the convergent optimization procedure, which assimilates the supermodel meta-parameters to data. Herein, we focus on modeling the evolution of the system described by a set of PDEs. We prove that supermodeling is conditionally convergent to a fixed-point attractor regarding only the supermodel meta-parameters. We investigate the formal conditions of the convergence of the supermodeling scheme theoretically. We employ the Banach fixed point theorem for the supermodeling correction operator, updating the synchronization constants' values iteratively. The "nudging" of the supermodel to the ground truth should be well balanced because both too small and too large attraction to data cause the supermodel desynchronization. The time-step size can control the convergence of the training procedure, by balancing the Lipshitz continuity constant of the PDE operator. All the sub-models have to be close to the ground-truth along the training trajectory but still sufficiently diverse to explore the phase space better. As an example, we discuss the three-dimensional supermodel of tumor evolution to demonstrate the supermodel's perfect fit to artificial data generated based on real medical images.

preprint2020arXiv

A Fine-Grained Hybrid CPU-GPU Algorithm for Betweenness Centrality Computations

Betweenness centrality (BC) is an important graph analytical application for large-scale graphs. While there are many efforts for parallelizing betweenness centrality algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel fine-grained CPU-GPU hybrid algorithm that partitions a graph into CPU and GPU partitions, and performs BC computations for the graph on both the CPU and GPU resources simultaneously with very small number of CPU-GPU communications. The forward phase in our hybrid BC algorithm leverages the multi-source property inherent in the BC problem. We also perform a novel hybrid and asynchronous backward phase that performs minimal CPU-GPU synchronizations. Evaluations using a large number of graphs with different characteristics show that our hybrid approach gives 80% improvement in performance, and 80-90% less CPU-GPU communications than an existing hybrid algorithm based on the popular Bulk Synchronous Paradigm (BSP) approach.

preprint2020arXiv

An Adaptive Load Balancer For Graph Analytical Applications on GPUs

Load-balancing among the threads of a GPU for graph analytics workloads is difficult because of the irregular nature of graph applications and the high variability in vertex degrees, particularly in power-law graphs. We describe a novel load balancing scheme to address this problem. Our scheme is implemented in the IrGL compiler to allow users to generate efficient load balanced code for a GPU from high-level sequential programs. We evaluated several graph analytics applications on up to 16 distributed GPUs using IrGL to compile the code and the Gluon substrate for inter-GPU communication. Our experiments show that this scheme can achieve an average speed-up of 2.2x on inputs that suffer from severe load imbalance problems when previous state-of-the-art load-balancing schemes are used.

preprint2020arXiv

Pangolin: An Efficient and Flexible Graph Pattern Mining System on CPU and GPU

There is growing interest in graph pattern mining (GPM) problems such as motif counting. GPM systems have been developed to provide unified interfaces for programming algorithms for these problems and for running them on parallel systems. However, existing systems may take hours to mine even simple patterns in moderate-sized graphs, which significantly limits their real-world usability. We present Pangolin, a high-performance and flexible in-memory GPM framework targeting shared-memory CPUs and GPUs. Pangolin is the first GPM system that provides high-level abstractions for GPU processing. It provides a simple programming interface based on the extend-reduce-filter model, which enables users to specify application-specific knowledge for search space pruning and isomorphism test elimination. We describe novel optimizations that exploit locality, reduce memory consumption, and mitigate the overheads of dynamic memory allocation and synchronization. Evaluation on a 28-core CPU demonstrates that Pangolin outperforms existing GPM frameworks Arabesque, RStream, and Fractal by 49x, 88x, and 80x on average, respectively. Acceleration on a V100 GPU further improves performance of Pangolin by 15x on average. Compared to state-of-the-art hand-optimized GPM applications, Pangolin provides competitive performance with less programming effort.

preprint2020arXiv

Single Machine Graph Analytics on Massive Datasets Using Intel Optane DC Persistent Memory

Intel Optane DC Persistent Memory (Optane PMM) is a new kind of byte-addressable memory with higher density and lower cost than DRAM. This enables the design of affordable systems that support up to 6TB of randomly accessible memory. In this paper, we present key runtime and algorithmic principles to consider when performing graph analytics on extreme-scale graphs on large-memory platforms of this sort. To demonstrate the importance of these principles, we evaluate four existing shared-memory graph frameworks on large real-world web-crawls, using a machine with 6TB of Optane PMM. Our results show that frameworks based on the runtime and algorithmic principles advocated in this paper (i) perform significantly better than the others, and (ii) are competitive with graph analytics frameworks running on large production clusters.

preprint2016arXiv

Adaptive Work-Efficient Connected Components on the GPU

This report presents an adaptive work-efficient approach for implementing the Connected Components algorithm on GPUs. The results show a considerable increase in performance (up to 6.8$\times$) over current state-of-the-art solutions.

preprint2016arXiv

Lowering IrGL to CUDA

The IrGL intermediate representation is an explicitly parallel representation for irregular programs that targets GPUs. In this report, we describe IrGL constructs, examples of their use and how IrGL is compiled to CUDA by the Galois GPU compiler.

preprint2012arXiv

Processor Allocation for Optimistic Parallelization of Irregular Programs

Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution of the irregular algorithm. Therefore, the processor allocation problem for irregular algorithms is very difficult. In this paper, we describe the first systematic strategy for addressing this problem. Our approach is based on a construct called the conflict graph, which (i) provides insight into the amount of parallelism that can be extracted from an irregular algorithm, and (ii) can be used to address the processor allocation problem for irregular algorithms. We show that this problem is related to a generalization of the unfriendly seating problem and, by extending Turán's theorem, we obtain a worst-case class of problems for optimistic parallelization, which we use to derive a lower bound on the exploitable parallelism. Finally, using some theoretically derived properties and some experimental facts, we design a quick and stable control strategy for solving the processor allocation problem heuristically.

Keshav Pingali

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

Scalable Hypergraph Embedding System

Supermodeling of tumor dynamics with parallel isogeometric analysis solver

A Fine-Grained Hybrid CPU-GPU Algorithm for Betweenness Centrality Computations

An Adaptive Load Balancer For Graph Analytical Applications on GPUs

Pangolin: An Efficient and Flexible Graph Pattern Mining System on CPU and GPU

Single Machine Graph Analytics on Massive Datasets Using Intel Optane DC Persistent Memory

Adaptive Work-Efficient Connected Components on the GPU

Lowering IrGL to CUDA

Processor Allocation for Optimistic Parallelization of Irregular Programs