Source author record

Abhinav Jangda

Abhinav Jangda appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Programming Languages Distributed, Parallel, and Cluster Computing Machine Learning

Catalog footprint

What is connected

4works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.

preprint2020arXiv

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

Domain-specific languages that execute image processing pipelineson GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have limitations: 1) they require intra thread block synchronization, which has a non-trivial cost, 2) they must choose between small tiles that require more overlapped computations or large tiles that increase shared memory access (and lowers occupancy), and 3) their autoscheduling algorithms use simplified GPU models that can result in inefficient global memory accesses. We present a new approach for executing image processing pipelines on GPUs that addresses these limitations as follows. 1) We fuse loops to form overlapped tiles that fit in a single warp, which allows us to use lightweight warp synchronization. 2) We introduce hybrid tiling, which stores overlapped regions in a combination of thread-local registers and shared memory. Thus hybrid tiling either increases occupancy by decreasing shared memory usage or decreases overlapping computations using larger tiles. 3) We present an automatic loop fusion algorithm that considers several factors that affect the performance of GPU kernels. We implement these techniques in PolyMage-GPU, which is a new GPU backend for PolyMage. Our approach produces code that is faster than Halide's manual schedules: 1.65x faster on an NVIDIA GTX 1080Ti and 1.33 faster on an NVIDIA Tesla V100.

preprint2019arXiv

Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code

All major web browsers now support WebAssembly, a low-level bytecode intended to serve as a compilation target for code written in languages like C and C++. A key goal of WebAssembly is performance parity with native code; previous work reports near parity, with many applications compiled to WebAssembly running on average 10% slower than native code. However, this evaluation was limited to a suite of scientific kernels, each consisting of roughly 100 lines of code. Running more substantial applications was not possible because compiling code to WebAssembly is only part of the puzzle: standard Unix APIs are not available in the web browser environment. To address this challenge, we build Browsix-Wasm, a significant extension to Browsix that, for the first time, makes it possible to run unmodified WebAssembly-compiled Unix applications directly inside the browser. We then use Browsix-Wasm to conduct the first large-scale evaluation of the performance of WebAssembly vs. native. Across the SPEC CPU suite of benchmarks, we find a substantial performance gap: applications compiled to WebAssembly run slower by an average of 45% (Firefox) to 55% (Chrome), with peak slowdowns of 2.08x (Firefox) and 2.5x (Chrome). We identify the causes of this performance degradation, some of which are due to missing optimizations and code generation issues, while others are inherent to the WebAssembly platform.

preprint2015arXiv

Block-Level Parallelism in Parsing Block Structured Languages

Softwares source code is becoming large and complex. Compilation of large base code is a time consuming process. Parallel compilation of code will help in reducing the time complexity. Parsing is one of the phases in compiler in which significant amount of time of compilation is spent. Techniques have already been developed to extract the parallelism available in parser. Current LR(k) parallel parsing techniques either face difficulty in creating Abstract Syntax Tree or requires modification in the grammar or are specific to less expressive grammars. Most of the programming languages like C, ALGOL are block-structured, and in most languages grammars the grammar of different blocks is independent, allowing different blocks to be parsed in parallel. We are proposing a block level parallel parser derived from Incremental Jump Shift Reduce Parser by [13]. Block Parallelized Parser (BPP) can even work as a block parallel incremental parser. We define a set of Incremental Categories and create the partitions of a grammar based on a rule. When parser reaches the start of the block symbol it will check whether the current block is related to any incremental category. If block parallel parser find the incremental category for it, parser will parse the block in parallel. Block parallel parser is developed for LR(1) grammar. Without making major changes in Shift Reduce (SR) LR(1) parsing algorithm, block parallel parser can create an Abstract Syntax tree easily. We believe this parser can be easily extended to LR (k) grammars and also be converted to an LALR (1) parser. We implemented BPP and SR LR(1) parsing algorithm for C Programming Language. We evaluated performance of both techniques by parsing 10 random files from Linux Kernel source. BPP showed 28% and 52% improvement in the case of including header files and excluding header files respectively.