Source author record

Sandeep Gupta

Sandeep Gupta appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Artificial Intelligence Distributed, Parallel, and Cluster Computing Computational Complexity Computational Geometry Computer Science and Game Theory Computer Vision Discrete Mathematics eess.SY Hardware Architecture Machine Learning Networking and Internet Architecture physics.ins-det Systems and Control

Catalog footprint

What is connected

9works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Hardware Acceleration for Neural Networks: A Comprehensive Survey

Neural networks have become dominant computational workloads across cloud and edge platforms, but their rapid growth in model size and deployment diversity has exposed hardware bottlenecks increasingly dominated by memory movement, communication, and irregular operators rather than peak arithmetic throughput. This survey reviews the current technology landscape for hardware acceleration of deep learning, spanning GPUs and tensor-core architectures, domain-specific accelerators (TPUs, NPUs), FPGA-based designs, ASIC inference engines, and emerging LLM-serving accelerators such as LPUs, alongside in-/near-memory computing and neuromorphic/analog approaches. We organize the survey using a unified taxonomy across (i) workloads (CNNs, RNNs, GNNs, Transformers/LLMs), (ii) execution settings (training vs.\ inference; datacenter vs.\ edge), and (iii) optimization levers (reduced precision, sparsity and pruning, operator fusion, compilation and scheduling, memory-system/interconnect design). We synthesize key architectural ideas such as systolic arrays, vector and SIMD engines, specialized attention and softmax kernels, quantization-aware datapaths, and high-bandwidth memory, and discuss how software stacks and compilers bridge model semantics to hardware. Finally, we highlight open challenges -- including efficient long-context LLM inference (KV-cache management), robust support for dynamic and sparse workloads, energy- and security-aware deployment, and fair benchmarking -- pointing to promising directions for the next generation of neural acceleration.

preprint2026arXiv

XAI-MeD: Explainable Knowledge Guided Neuro-Symbolic Framework for Domain Generalization and Rare Class Detection in Medical Imaging

Explainability domain generalization and rare class reliability are critical challenges in medical AI where deep models often fail under real world distribution shifts and exhibit bias against infrequent clinical conditions This paper introduces XAIMeD an explainable medical AI framework that integrates clinically accurate expert knowledge into deep learning through a unified neuro symbolic architecture XAIMeD is designed to improve robustness under distribution shift enhance rare class sensitivity and deliver transparent clinically aligned interpretations The framework encodes clinical expertise as logical connectives over atomic medical propositions transforming them into machine checkable class specific rules Their diagnostic utility is quantified through weighted feature satisfaction scores enabling a symbolic reasoning branch that complements neural predictions A confidence weighted fusion integrates symbolic and deep outputs while a Hunt inspired adaptive routing mechanism guided by Entropy Imbalance Gain EIG and Rare Class Gini mitigates class imbalance high intra class variability and uncertainty We evaluate XAIMeD across diverse modalities on four challenging tasks i Seizure Onset Zone SOZ localization from rs fMRI ii Diabetic Retinopathy grading across 6 multicenter datasets demonstrate substantial performance improvements including 6 percent gains in cross domain generalization and a 10 percent improved rare class F1 score far outperforming state of the art deep learning baselines Ablation studies confirm that the clinically grounded symbolic components act as effective regularizers ensuring robustness to distribution shifts XAIMeD thus provides a principled clinically faithful and interpretable approach to multimodal medical AI.

preprint2025arXiv

Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics

Physical AI at the edge -- enabling autonomous systems to understand and predict real-world dynamics in real time -- requires hardware-efficient learning and inference. Model recovery (MR), which identifies governing equations from sensor data, is a key primitive for safe and explainable monitoring in mission-critical autonomous systems operating under strict latency, compute, and power constraints. However, state-of-the-art MR methods (e.g., EMILY and PINN+SR) rely on Neural ODE formulations that require iterative solvers and are difficult to accelerate efficiently on edge hardware. We present \textbf{MERINDA} (Model Recovery in Reconfigurable Dynamic Architecture), an FPGA-accelerated MR framework designed to make physical AI practical on resource-constrained devices. MERINDA replaces expensive Neural ODE components with a hardware-friendly formulation that combines (i) GRU-based discretized dynamics, (ii) dense inverse-ODE layers, (iii) sparsity-driven dropout, and (iv) lightweight ODE solvers. The resulting computation is structured for streaming parallelism, enabling critical kernels to be fully parallelized on the FPGA. Across four benchmark nonlinear dynamical systems, MERINDA delivers substantial gains over GPU implementations: \textbf{114$\times$ lower energy} (434~J vs.\ 49{,}375~J), \textbf{28$\times$ smaller memory footprint} (214~MB vs.\ 6{,}118~MB), and \textbf{1.68$\times$ faster training}, while matching state-of-the-art model-recovery accuracy. These results demonstrate that MERINDA can bring accurate, explainable MR to the edge for real-time monitoring of autonomous systems.

preprint2021arXiv

Development of Diagnostics for High-Temperature High-Pressure Liquid Pb-16Li Applications

Liquid lead-lithium (Pb-16Li) is of primary interest as one of the candidate materials for tritium breeder, neutron multiplier and coolant fluid in liquid metal blanket concepts relevant to fusion power plants. For an effective and reliable operation of such high temperature liquid metal systems, monitoring and control of critical process parameters is essential. However, limited operational experience coupled with high temperature operating conditions and corrosive nature of Pb-16Li severely limits application of commercially available diagnostic tools. This paper illustrates indigenous calibration test facility designs and experimental methods used to develop non-contact configuration level diagnostics using pulse radar level sensor, wetted configuration pressure diagnostics using diaphragm seal type pressure sensor and bulk temperature diagnostics with temperature profiling for high temperature, high pressure liquid Pb and Pb-16Li applications. Calibration check of these sensors was performed using analytical methods, at temperature between 380C-400C and pressure upto 1 MPa (g). Reliability and performance validation were achieved through long duration testing of sensors in liquid Pb and liquid Pb-16Li environment for over 1000 hour. Estimated deviation for radar level sensor lies within [-3.36 mm, +13.64 mm] and the estimated error for pressure sensor lies within 1.1% of calibrated span over the entire test duration. Results obtained and critical observations from these tests are presented in this paper.

preprint2014arXiv

Citations, Sequence Alignments, Contagion, and Semantics: On Acyclic Structures and their Randomness

Datasets from several domains, such as life-sciences, semantic web, machine learning, natural language processing, etc. are naturally structured as acyclic graphs. These datasets, particularly those in bio-informatics and computational epidemiology, have grown tremendously over the last decade or so. Increasingly, as a consequence, there is a need to build and evaluate various strategies for processing acyclic structured graphs. Most of the proposed research models the real world acyclic structures as random graphs, i.e., they are generated by randomly selecting a subset of edges from all possible edges. Unfortunately the graphs thus generated have predictable and degenerate structures, i.e., the resulting graphs will always have almost the same degree distribution and very short paths. Specifically, we show that if $O(n \log n \log n)$ edges are added to a binary tree of $n$ nodes then with probability more than $O(1/(\log n)^{1/n})$ the depth of all but $O({\log \log n} ^{\log \log n})$ vertices of the dag collapses to 1. Experiments show that irregularity, as measured by distribution of length of random walks from root to leaves, is also predictable and small. The degree distribution and random walk length properties of real world graphs from these domains are significantly different from random graphs of similar vertex and edge size.

preprint2012arXiv

External Memory based Distributed Generation of Massive Scale Social Networks on Small Clusters

Small distributed systems are limited by their main memory to generate massively large graphs. Trivial extension to current graph generators to utilize external memory leads to large amount of random I/O hence do not scale with size. In this work we offer a technique to generate massive scale graphs on small cluster of compute nodes with limited main memory. We develop several distributed and external memory algorithms, primarily, shuffle, relabel, redistribute, and, compressed-sparse-row (csr) convert. The algorithms are implemented in MPI/pthread model to help parallelize the operations across multicores within each core. Using our scheme it is feasible to generate a graph of size $2^{38}$ nodes (scale 38) using only 64 compute nodes. This can be compared with the current scheme would require at least 8192 compute node, assuming 64GB of main memory. Our work has broader implications for external memory graph libraries such as STXXL and graph processing on SSD-based supercomputers such as Dash and Gordon [1][2].

preprint2012arXiv

Lower bounds for Arrangement-based Range-Free Localization in Sensor Networks

Colander are location aware entities that collaborate to determine approximate location of mobile or static objects when beacons from an object are received by all colanders that are within its distance $R$. This model, referred to as arrangement-based localization, does not require distance estimation between entities, which has been shown to be highly erroneous in practice. Colander are applicable in localization in sensor networks and tracking of mobile objects. A set $S \subset {\mathbb R}^2$ is an $(R,ε)$-colander if by placing receivers at the points of $S$, a wireless device with transmission radius $R$ can be localized to within a circle of radius $ε$. We present tight upper and lower bounds on the size of $(R,ε)$-colanders. We measure the expected size of colanders that will form $(R, ε)$-colanders if they distributed uniformly over the plane.

preprint2012arXiv

Pipelined Workflow in Hybrid MPI/Pthread runtime for External Memory Graph Construction

Graph construction from a given set of edges is a data-intensive operator that appears in social network analysis, ontology enabled databases, and, other analytics processing. The operator represents an edge list to compressed sparse row (CSR) representation (or sometimes in adjacency list, or as clustered B-Tree storage). In this work, we show how to scale CSR construction to massive scale on SSD-enabled supercomputers such as Gordon using pipelined processing. We develop several abstraction and operations for external memory and parallel edge list and integer array processing that are utilized towards building a scalable algorithm for creating CSR representation. Our experiments demonstrate that this scheme is four to six times faster than currently available implementation. Moreover, our scheme can handle up to 8 billion edges (128GB) by using external memory as compared to prior schemes where performance degrades considerably for edge list size 26 million and beyond.

preprint2002arXiv

The STRESS Method for Boundary-point Performance Analysis of End-to-end Multicast Timer-Suppression Mechanisms

Evaluation of Internet protocols usually uses random scenarios or scenarios based on designers' intuition. Such approach may be useful for average-case analysis but does not cover boundary-point (worst or best-case) scenarios. To synthesize boundary-point scenarios a more systematic approach is needed.In this paper, we present a method for automatic synthesis of worst and best case scenarios for protocol boundary-point evaluation. Our method uses a fault-oriented test generation (FOTG) algorithm for searching the protocol and system state space to synthesize these scenarios. The algorithm is based on a global finite state machine (FSM) model. We extend the algorithm with timing semantics to handle end-to-end delays and address performance criteria. We introduce the notion of a virtual LAN to represent delays of the underlying multicast distribution tree. The algorithms used in our method utilize implicit backward search using branch and bound techniques and start from given target events. This aims to reduce the search complexity drastically. As a case study, we use our method to evaluate variants of the timer suppression mechanism, used in various multicast protocols, with respect to two performance criteria: overhead of response messages and response time. Simulation results for reliable multicast protocols show that our method provides a scalable way for synthesizing worst-case scenarios automatically. Results obtained using stress scenarios differ dramatically from those obtained through average-case analyses. We hope for our method to serve as a model for applying systematic scenario generation to other multicast protocols.

Sandeep Gupta

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Hardware Acceleration for Neural Networks: A Comprehensive Survey

XAI-MeD: Explainable Knowledge Guided Neuro-Symbolic Framework for Domain Generalization and Rare Class Detection in Medical Imaging

Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics

Development of Diagnostics for High-Temperature High-Pressure Liquid Pb-16Li Applications

Citations, Sequence Alignments, Contagion, and Semantics: On Acyclic Structures and their Randomness

External Memory based Distributed Generation of Massive Scale Social Networks on Small Clusters

Lower bounds for Arrangement-based Range-Free Localization in Sensor Networks

Pipelined Workflow in Hybrid MPI/Pthread runtime for External Memory Graph Construction

The STRESS Method for Boundary-point Performance Analysis of End-to-end Multicast Timer-Suppression Mechanisms