Researcher profile

Abhinav Jangda

Abhinav Jangda contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2022arXiv

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.

preprint2020arXiv

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

Domain-specific languages that execute image processing pipelineson GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have limitations: 1) they require intra thread block synchronization, which has a non-trivial cost, 2) they must choose between small tiles that require more overlapped computations or large tiles that increase shared memory access (and lowers occupancy), and 3) their autoscheduling algorithms use simplified GPU models that can result in inefficient global memory accesses. We present a new approach for executing image processing pipelines on GPUs that addresses these limitations as follows. 1) We fuse loops to form overlapped tiles that fit in a single warp, which allows us to use lightweight warp synchronization. 2) We introduce hybrid tiling, which stores overlapped regions in a combination of thread-local registers and shared memory. Thus hybrid tiling either increases occupancy by decreasing shared memory usage or decreases overlapping computations using larger tiles. 3) We present an automatic loop fusion algorithm that considers several factors that affect the performance of GPU kernels. We implement these techniques in PolyMage-GPU, which is a new GPU backend for PolyMage. Our approach produces code that is faster than Halide's manual schedules: 1.65x faster on an NVIDIA GTX 1080Ti and 1.33 faster on an NVIDIA Tesla V100.

preprint2019arXiv

Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code

All major web browsers now support WebAssembly, a low-level bytecode intended to serve as a compilation target for code written in languages like C and C++. A key goal of WebAssembly is performance parity with native code; previous work reports near parity, with many applications compiled to WebAssembly running on average 10% slower than native code. However, this evaluation was limited to a suite of scientific kernels, each consisting of roughly 100 lines of code. Running more substantial applications was not possible because compiling code to WebAssembly is only part of the puzzle: standard Unix APIs are not available in the web browser environment. To address this challenge, we build Browsix-Wasm, a significant extension to Browsix that, for the first time, makes it possible to run unmodified WebAssembly-compiled Unix applications directly inside the browser. We then use Browsix-Wasm to conduct the first large-scale evaluation of the performance of WebAssembly vs. native. Across the SPEC CPU suite of benchmarks, we find a substantial performance gap: applications compiled to WebAssembly run slower by an average of 45% (Firefox) to 55% (Chrome), with peak slowdowns of 2.08x (Firefox) and 2.5x (Chrome). We identify the causes of this performance degradation, some of which are due to missing optimizations and code generation issues, while others are inherent to the WebAssembly platform.