Researcher profile

Nandakishore Santhi

Nandakishore Santhi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instructions with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. Using microbenchmarks, we measure the clock cycles for PTX ISA instructions and their SASS ISA instructions counterpart. we further calculate the clock cycle needed to access each memory unit. We also demystify the new version of the tensor core unit found in the Ampere architecture by using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. The results found in this work should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.

preprint2022arXiv

Quantum Netlist Compiler (QNC)

Over the last decade, Quantum Computing hardware has rapidly developed and become a very intriguing, promising, and active research field among scientists worldwide. To achieve the desired quantum functionalities, quantum algorithms require translation from a high-level description to a machine-specific physical operation sequence. In contrast to classical compilers, state-of-the-art quantum compilers are in their infancy. We believe there is a research need for a quantum compiler that can deal with generic unitary operators and generate basic unitary operations according to quantum machines' diverse underlying technologies and characteristics. In this work, we introduce the Quantum Netlist Compiler (QNC) that converts arbitrary unitary operators or desired initial states of quantum algorithms to OpenQASM-2.0 circuits enabling them to run on actual quantum hardware. Extensive simulations were run on the IBM quantum systems. The results show that QNC is well suited for quantum circuit optimization and produces circuits with competitive success rates in practice.

preprint2020arXiv

Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs

GPUs are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various PTX instructions found in modern NVIDIA GPUs. We provide an exhaustive comparison of more than 40 instructions for four high-end NVIDIA GPUs from four different generations (Maxwell, Pascal, Volta, and Turing). Furthermore, we show the effect of the CUDA compiler optimizations on the energy consumption of each instruction. We use three different software techniques to read the GPU on-chip power sensors, which use NVIDIA's NVML API and provide an in-depth comparison between these techniques. Additionally, we verified the software measurement techniques against a custom-designed hardware power measurement. The results show that Volta GPUs have the best energy efficiency of all the other generations for the different categories of the instructions. This work should aid in understanding NVIDIA GPUs' microarchitecture. It should also make energy measurements of any GPU kernel both efficient and accurate.

preprint2010arXiv

A Quadratic Time-Space Tradeoff for Unrestricted Deterministic Decision Branching Programs

For a decision problem from coding theory, we prove a quadratic expected time-space tradeoff of the form $\eT\eS=Ω(\tfrac{n^2}{q})$ for $q$-way deterministic decision branching programs, where $q\geq 2$. Here $\eT$ is the expected computation time and $\eS$ is the expected space, when all inputs are equally likely. This bound is to our knowledge, the first such to show an exponential size requirement whenever $\eT = O(n^2)$. Previous exponential size tradeoffs for Boolean decision branching programs were valid for time-restricted models with $T=o(n\log_2{n})$. Proving quadratic time-space tradeoffs for unrestricted time decision branching programs has been a major goal of recent research -- this goal has already been achieved for multiple-output branching programs two decades ago. We also show the first quadratic time-space tradeoffs for Boolean decision branching programs verifying circular convolution, matrix-vector multiplication and discrete Fourier transform. Furthermore, we demonstrate a constructive Boolean decision function which has a quadratic expected time-space tradeoff in the Boolean deterministic decision branching program model. When $q$ is a constant the tradeoff results derived here for decision functions verifying various functions are order-comparable to previously known tradeoff bounds for calculating the corresponding multiple-output functions.

preprint2010arXiv

Understanding Cascading Failures in Power Grids

In the past, we have observed several large blackouts, i.e. loss of power to large areas. It has been noted by several researchers that these large blackouts are a result of a cascade of failures of various components. As a power grid is made up of several thousands or even millions of components (relays, breakers, transformers, etc.), it is quite plausible that a few of these components do not perform their function as desired. Their failure/misbehavior puts additional burden on the working components causing them to misbehave, and thus leading to a cascade of failures. The complexity of the entire power grid makes it difficult to model each and every individual component and study the stability of the entire system. For this reason, it is often the case that abstract models of the working of the power grid are constructed and then analyzed. These models need to be computationally tractable while serving as a reasonable model for the entire system. In this work, we construct one such model for the power grid, and analyze it.