Source author record

Nandakishore Santhi

Nandakishore Santhi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Discrete Mathematics Information Theory math.IT Applications Computational Complexity Distributed, Parallel, and Cluster Computing Hardware Architecture math.PR Performance quant-ph Social and Information Networks

Catalog footprint

What is connected

6works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instructions with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. Using microbenchmarks, we measure the clock cycles for PTX ISA instructions and their SASS ISA instructions counterpart. we further calculate the clock cycle needed to access each memory unit. We also demystify the new version of the tensor core unit found in the Ampere architecture by using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. The results found in this work should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.

preprint2022arXiv

Quantum Netlist Compiler (QNC)

Over the last decade, Quantum Computing hardware has rapidly developed and become a very intriguing, promising, and active research field among scientists worldwide. To achieve the desired quantum functionalities, quantum algorithms require translation from a high-level description to a machine-specific physical operation sequence. In contrast to classical compilers, state-of-the-art quantum compilers are in their infancy. We believe there is a research need for a quantum compiler that can deal with generic unitary operators and generate basic unitary operations according to quantum machines' diverse underlying technologies and characteristics. In this work, we introduce the Quantum Netlist Compiler (QNC) that converts arbitrary unitary operators or desired initial states of quantum algorithms to OpenQASM-2.0 circuits enabling them to run on actual quantum hardware. Extensive simulations were run on the IBM quantum systems. The results show that QNC is well suited for quantum circuit optimization and produces circuits with competitive success rates in practice.

preprint2020arXiv

Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs

GPUs are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various PTX instructions found in modern NVIDIA GPUs. We provide an exhaustive comparison of more than 40 instructions for four high-end NVIDIA GPUs from four different generations (Maxwell, Pascal, Volta, and Turing). Furthermore, we show the effect of the CUDA compiler optimizations on the energy consumption of each instruction. We use three different software techniques to read the GPU on-chip power sensors, which use NVIDIA's NVML API and provide an in-depth comparison between these techniques. Additionally, we verified the software measurement techniques against a custom-designed hardware power measurement. The results show that Volta GPUs have the best energy efficiency of all the other generations for the different categories of the instructions. This work should aid in understanding NVIDIA GPUs' microarchitecture. It should also make energy measurements of any GPU kernel both efficient and accurate.

preprint2010arXiv

A Quadratic Time-Space Tradeoff for Unrestricted Deterministic Decision Branching Programs

For a decision problem from coding theory, we prove a quadratic expected time-space tradeoff of the form $\eT\eS=Ω(\tfrac{n^2}{q})$ for $q$-way deterministic decision branching programs, where $q\geq 2$. Here $\eT$ is the expected computation time and $\eS$ is the expected space, when all inputs are equally likely. This bound is to our knowledge, the first such to show an exponential size requirement whenever $\eT = O(n^2)$. Previous exponential size tradeoffs for Boolean decision branching programs were valid for time-restricted models with $T=o(n\log_2{n})$. Proving quadratic time-space tradeoffs for unrestricted time decision branching programs has been a major goal of recent research -- this goal has already been achieved for multiple-output branching programs two decades ago. We also show the first quadratic time-space tradeoffs for Boolean decision branching programs verifying circular convolution, matrix-vector multiplication and discrete Fourier transform. Furthermore, we demonstrate a constructive Boolean decision function which has a quadratic expected time-space tradeoff in the Boolean deterministic decision branching program model. When $q$ is a constant the tradeoff results derived here for decision functions verifying various functions are order-comparable to previously known tradeoff bounds for calculating the corresponding multiple-output functions.

preprint2010arXiv

Understanding Cascading Failures in Power Grids

In the past, we have observed several large blackouts, i.e. loss of power to large areas. It has been noted by several researchers that these large blackouts are a result of a cascade of failures of various components. As a power grid is made up of several thousands or even millions of components (relays, breakers, transformers, etc.), it is quite plausible that a few of these components do not perform their function as desired. Their failure/misbehavior puts additional burden on the working components causing them to misbehave, and thus leading to a cascade of failures. The complexity of the entire power grid makes it difficult to model each and every individual component and study the stability of the entire system. For this reason, it is often the case that abstract models of the working of the power grid are constructed and then analyzed. These models need to be computationally tractable while serving as a reasonable model for the entire system. In this work, we construct one such model for the power grid, and analyze it.

preprint2007arXiv

On Algebraic Decoding of $q$-ary Reed-Muller and Product-Reed-Solomon Codes

We consider a list decoding algorithm recently proposed by Pellikaan-Wu \cite{PW2005} for $q$-ary Reed-Muller codes $\mathcal{RM}_q(\ell, m, n)$ of length $n \leq q^m$ when $\ell \leq q$. A simple and easily accessible correctness proof is given which shows that this algorithm achieves a relative error-correction radius of $τ\leq (1 - \sqrt{\ell q^{m-1}/{n}})$. This is an improvement over the proof using one-point Algebraic-Geometric codes given in \cite{PW2005}. The described algorithm can be adapted to decode Product-Reed-Solomon codes. We then propose a new low complexity recursive algebraic decoding algorithm for Reed-Muller and Product-Reed-Solomon codes. Our algorithm achieves a relative error correction radius of $τ\leq \prod_{i=1}^m (1 - \sqrt{k_i/q})$. This technique is then proved to outperform the Pellikaan-Wu method in both complexity and error correction radius over a wide range of code rates.