Source author record

Kaushik Roy

Kaushik Roy appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

79works

22topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to $9.6\%$ and a $35\times$ smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model's zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.

preprint2026arXiv

HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models

Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. For example, changing a subject or adjusting an attribute often leads to unintended side effects, such as altered backgrounds or distorted details. This is because most existing text-based control methods treat the embedding space as Euclidean and apply simple linear transformations, which do not reflect how semantic concepts are actually organized. In this work, we take a step back and ask: what is the true geometry of these embeddings? We find that text encoder representations lie on a hypersphere, where concepts are not linear directions but structured, anisotropic distributions better captured by Kent distributions. Building on this insight, we propose HEART, a training-free framework that performs Kent-aware geodesic transformations directly on the hypersphere. By respecting the underlying geometry, HEART enables intuitive and precise edits, such as consistent subject replacement and fine-grained attribute control, while preserving the original scene. Importantly, HEART requires no finetuning, inversion, or optimization, and generalizes across diffusion model architectures. Our results show that a simple shift in perspective, from linear to spherical, can unlock fast, and controllable image generation.

preprint2026arXiv

MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning

In Online Continual Learning (OCL), a neural network sequentially learns from a non-stationary data stream in a single-pass with access only to a limited memory replay buffer. This contrasts sharply with off-line continual learning where training is multiple epoch dependent on large datasets. The main challenge faced by OCL is to overcome catastrophic forgetting of past tasks (stability) while learning new ones efficiently (plasticity). Existing methods counter forgetting via replay-based rehearsal, output level distillation, fixed regularization, or meta-learning on the current data. However, these methods have limitations: rehearsal introduces a stored sample bias; distillation operates on output-distributions without modulating parameter updates; fixed-regularization penalizes parameters irrespective of sensitivity; stream-only meta-learning lacks a feedback controlled parameter update. We propose Meta-Adaptive Network Gradient Optimization (MANGO), an OCL framework that balances stability-plasticity via gradient-gating and meta-learned regularization. Gradient-gating scales parameter updates based on sensitivity, preventing destructive updates. Meta-learned regularization adapts stability coefficients, evaluating the effect of parameter update on replay. In MANGO, replay acts as both a training signal and a forgetting evaluator. We evaluated our method on three standard OCL benchmark datasets. MANGO outperforms strong baselines, achieving state-of-the-art results with consistent performance across replay sizes. In domain incremental learning on CLEAR-10 and class incremental learning on CIFAR-100 and Tiny-ImageNet, it achieves highest accuracy among all baselines and achieves positive Backward Transfer, overcoming forgetting on CLEAR-10.

preprint2024arXiv

EV-Planner: Energy-Efficient Robot Navigation via Event-Based Physics-Guided Neuromorphic Planner

Vision-based object tracking is an essential precursor to performing autonomous aerial navigation in order to avoid obstacles. Biologically inspired neuromorphic event cameras are emerging as a powerful alternative to frame-based cameras, due to their ability to asynchronously detect varying intensities (even in poor lighting conditions), high dynamic range, and robustness to motion blur. Spiking neural networks (SNNs) have gained traction for processing events asynchronously in an energy-efficient manner. On the other hand, physics-based artificial intelligence (AI) has gained prominence recently, as they enable embedding system knowledge via physical modeling inside traditional analog neural networks (ANNs). In this letter, we present an event-based physics-guided neuromorphic planner (EV-Planner) to perform obstacle avoidance using neuromorphic event cameras and physics-based AI. We consider the task of autonomous drone navigation where the mission is to detect moving gates and fly through them while avoiding a collision. We use event cameras to perform object detection using a shallow spiking neural network in an unsupervised fashion. Utilizing the physical equations of the brushless DC motors present in the drone rotors, we train a lightweight energy-aware physics-guided neural network (PgNN) with depth inputs. This predicts the optimal flight time responsible for generating near-minimum energy paths. We spawn the drone in the Gazebo simulator and implement a sensor-fused vision-to-planning neuro-symbolic framework using Robot Operating System (ROS). Simulation results for safe collision-free flight trajectories are presented with performance analysis, ablation study and potential future research directions

preprint2022arXiv

A Co-design view of Compute in-Memory with Non-Volatile Elements for Neural Networks

Deep Learning neural networks are pervasive, but traditional computer architectures are reaching the limits of being able to efficiently execute them for the large workloads of today. They are limited by the von Neumann bottleneck: the high cost in energy and latency incurred in moving data between memory and the compute engine. Today, special CMOS designs address this bottleneck. The next generation of computing hardware will need to eliminate or dramatically mitigate this bottleneck. We discuss how compute-in-memory can play an important part in this development. Here, a non-volatile memory based cross-bar architecture forms the heart of an engine that uses an analog process to parallelize the matrix vector multiplication operation, repeatedly used in all neural network workloads. The cross-bar architecture, at times referred to as a neuromorphic approach, can be a key hardware element in future computing machines. In the first part of this review we take a co-design view of the design constraints and the demands it places on the new materials and memory devices that anchor the cross-bar architecture. In the second part, we review what is knows about the different new non-volatile memory materials and devices suited for compute in-memory, and discuss the outlook and challenges.

preprint2022arXiv

Can Language Models Capture Graph Semantics? From Graphs to Language Model and Vice-Versa

Knowledge Graphs are a great resource to capture semantic knowledge in terms of entities and relationships between the entities. However, current deep learning models takes as input distributed representations or vectors. Thus, the graph is compressed in a vectorized representation. We conduct a study to examine if the deep learning model can compress a graph and then output the same graph with most of the semantics intact. Our experiments show that Transformer models are not able to express the full semantics of the input knowledge graph. We find that this is due to the disparity between the directed, relationship and type based information contained in a Knowledge Graph and the fully connected token-token undirected graphical interpretation of the Transformer Attention matrix.

preprint2022arXiv

Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts

Conversational Agents (CAs) powered with deep language models (DLMs) have shown tremendous promise in the domain of mental health. Prominently, the CAs have been used to provide informational or therapeutic services to patients. However, the utility of CAs to assist in mental health triaging has not been explored in the existing work as it requires a controlled generation of follow-up questions (FQs), which are often initiated and guided by the mental health professionals (MHPs) in clinical settings. In the context of depression, our experiments show that DLMs coupled with process knowledge in a mental health questionnaire generate 12.54% and 9.37% better FQs based on similarity and longest common subsequence matches to questions in the PHQ-9 dataset respectively, when compared with DLMs without process knowledge support. Despite coupling with process knowledge, we find that DLMs are still prone to hallucination, i.e., generating redundant, irrelevant, and unsafe FQs. We demonstrate the challenge of using existing datasets to train a DLM for generating FQs that adhere to clinical process knowledge. To address this limitation, we prepared an extended PHQ-9 based dataset, PRIMATE, in collaboration with MHPs. PRIMATE contains annotations regarding whether a particular question in the PHQ-9 dataset has already been answered in the user's initial description of the mental health condition. We used PRIMATE to train a DLM in a supervised setting to identify which of the PHQ-9 questions can be answered directly from the user's post and which ones would require more information from the user. Using performance analysis based on MCC scores, we show that PRIMATE is appropriate for identifying questions in PHQ-9 that could guide generative DLMs towards controlled FQ generation suitable for aiding triaging. Dataset created as a part of this research: https://github.com/primate-mh/Primate2022

preprint2022arXiv

Low Precision Decentralized Distributed Training over IID and non-IID Data

Decentralized distributed learning is the key to enabling large-scale machine learning (training) on edge devices utilizing private user-generated local data, without relying on the cloud. However, the practical realization of such on-device training is limited by the communication and compute bottleneck. In this paper, we propose and show the convergence of low precision decentralized training that aims to reduce the computational complexity and communication cost of decentralized training. Many feedback-based compression techniques have been proposed in the literature to reduce communication costs. To the best of our knowledge, there is no work that applies and shows compute efficient training techniques such as quantization, pruning, etc., for peer-to-peer decentralized learning setups. Since real-world applications have a significant skew in the data distribution, we design "Range-EvoNorm" as the normalization activation layer which is better suited for low precision training over non-IID data. Moreover, we show that the proposed low precision training can be used in synergy with other communication compression methods decreasing the communication cost further. Our experiments indicate that 8-bit decentralized training has minimal accuracy loss compared to its full precision counterpart even with non-IID data. However, when low precision training is accompanied by communication compression through sparsification we observe a 1-2% drop in accuracy. The proposed low precision decentralized training decreases computational complexity, memory usage, and communication cost by 4x and compute energy by a factor of ~20x, while trading off less than a $1\%$ accuracy for both IID and non-IID data. In particular, with higher skew values, we observe an increase in accuracy (by ~ 0.5%) with low precision training, indicating the regularization effect of the quantization.

preprint2022arXiv

Non-Volume Preserving-based Fusion to Group-Level Emotion Recognition on Crowd Videos

Group-level emotion recognition (ER) is a growing research area as the demands for assessing crowds of all sizes are becoming an interest in both the security arena as well as social media. This work extends the earlier ER investigations, which focused on either group-level ER on single images or within a video, by fully investigating group-level expression recognition on crowd videos. In this paper, we propose an effective deep feature level fusion mechanism to model the spatial-temporal information in the crowd videos. In our approach, the fusing process is performed on the deep feature domain by a generative probabilistic model, Non-Volume Preserving Fusion (NVPF), that models spatial information relationships. Furthermore, we extend our proposed spatial NVPF approach to the spatial-temporal NVPF approach to learn the temporal information between frames. To demonstrate the robustness and effectiveness of each component in the proposed approach, three experiments were conducted: (i) evaluation on AffectNet database to benchmark the proposed EmoNet for recognizing facial expression; (ii) evaluation on EmotiW2018 to benchmark the proposed deep feature level fusion mechanism NVPF; and, (iii) examine the proposed TNVPF on an innovative Group-level Emotion on Crowd Videos (GECV) dataset composed of 627 videos collected from publicly available sources. GECV dataset is a collection of videos containing crowds of people. Each video is labeled with emotion categories at three levels: individual faces, group of people, and the entire video frame.

preprint2022arXiv

Norm-Scaling for Out-of-Distribution Detection

Out-of-Distribution (OoD) inputs are examples that do not belong to the true underlying distribution of the dataset. Research has shown that deep neural nets make confident mispredictions on OoD inputs. Therefore, it is critical to identify OoD inputs for safe and reliable deployment of deep neural nets. Often a threshold is applied on a similarity score to detect OoD inputs. One such similarity is angular similarity which is the dot product of latent representation with the mean class representation. Angular similarity encodes uncertainty, for example, if the angular similarity is less, it is less certain that the input belongs to that class. However, we observe that, different classes have different distributions of angular similarity. Therefore, applying a single threshold for all classes is not ideal since the same similarity score represents different uncertainties for different classes. In this paper, we propose norm-scaling which normalizes the logits separately for each class. This ensures that a single value consistently represents similar uncertainty for various classes. We show that norm-scaling, when used with maximum softmax probability detector, achieves 9.78% improvement in AUROC, 5.99% improvement in AUPR and 33.19% reduction in FPR95 metrics over previous state-of-the-art methods.

preprint2022arXiv

Process Knowledge-Infused AI: Towards User-level Explainability, Interpretability, and Safety

AI systems have been widely adopted across various domains in the real world. However, in high-value, sensitive, or safety-critical applications such as self-management for personalized health or food recommendation with a specific purpose (e.g., allergy-aware recipe recommendations), their adoption is unlikely. Firstly, the AI system needs to follow guidelines or well-defined processes set by experts; the data alone will not be adequate. For example, to diagnose the severity of depression, mental healthcare providers use Patient Health Questionnaire (PHQ-9). So if an AI system were to be used for diagnosis, the medical guideline implied by the PHQ-9 needs to be used. Likewise, a nutritionist's knowledge and steps would need to be used for an AI system that guides a diabetic patient in developing a food plan. Second, the BlackBox nature typical of many current AI systems will not work; the user of an AI system will need to be able to give user-understandable explanations, explanations constructed using concepts that humans can understand and are familiar with. This is the key to eliciting confidence and trust in the AI system. For such applications, in addition to data and domain knowledge, the AI systems need to have access to and use the Process Knowledge, an ordered set of steps that the AI system needs to use or adhere to.

preprint2022arXiv

Process Knowledge-infused Learning for Suicidality Assessment on Social Media

Improving the performance and natural language explanations of deep learning algorithms is a priority for adoption by humans in the real world. In several domains, such as healthcare, such technology has significant potential to reduce the burden on humans by providing quality assistance at scale. However, current methods rely on the traditional pipeline of predicting labels from data, thus completely ignoring the process and guidelines used to obtain the labels. Furthermore, post hoc explanations on the data to label prediction using explainable AI (XAI) models, while satisfactory to computer scientists, leave much to be desired to the end-users due to lacking explanations of the process in terms of human-understandable concepts. We \textit{introduce}, \textit{formalize}, and \textit{develop} a novel Artificial Intelligence (A) paradigm -- Process Knowledge-infused Learning (PK-iL). PK-iL utilizes a structured process knowledge that explicitly explains the underlying prediction process that makes sense to end-users. The qualitative human evaluation confirms through a annotator agreement of 0.72, that humans are understand explanations for the predictions. PK-iL also performs competitively with the state-of-the-art (SOTA) baselines.

preprint2021arXiv

"Is depression related to cannabis?": A knowledge-infused model for Entity and Relation Extraction with Limited Supervision

With strong marketing advocacy of the benefits of cannabis use for improved mental health, cannabis legalization is a priority among legislators. However, preliminary scientific research does not conclusively associate cannabis with improved mental health. In this study, we explore the relationship between depression and consumption of cannabis in a targeted social media corpus involving personal use of cannabis with the intent to derive its potential mental health benefit. We use tweets that contain an association among three categories annotated by domain experts - Reason, Effect, and Addiction. The state-of-the-art Natural Langauge Processing techniques fall short in extracting these relationships between cannabis phrases and the depression indicators. We seek to address the limitation by using domain knowledge; specifically, the Drug Abuse Ontology for addiction augmented with Diagnostic and Statistical Manual of Mental Disorders lexicons for mental health. Because of the lack of annotations due to the limited availability of the domain experts' time, we use supervised contrastive learning in conjunction with GPT-3 trained on a vast corpus to achieve improved performance even with limited supervision. Experimental results show that our method can significantly extract cannabis-depression relationships better than the state-of-the-art relation extractor. High-quality annotations can be provided using a nearest neighbor approach using the learned representations that can be used by the scientific community to understand the association between cannabis and depression better.

preprint2021arXiv

Knowledge Infused Policy Gradients for Adaptive Pandemic Control

COVID-19 has impacted nations differently based on their policy implementations. The effective policy requires taking into account public information and adaptability to new knowledge. Epidemiological models built to understand COVID-19 seldom provide the policymaker with the capability for adaptive pandemic control (APC). Among the core challenges to be overcome include (a) inability to handle a high degree of non-homogeneity in different contributing features across the pandemic timeline, (b) lack of an approach that enables adaptive incorporation of public health expert knowledge, and (c) transparent models that enable understanding of the decision-making process in suggesting policy. In this work, we take the early steps to address these challenges using Knowledge Infused Policy Gradient (KIPG) methods. Prior work on knowledge infusion does not handle soft and hard imposition of varying forms of knowledge in disease information and guidelines to necessarily comply with. Furthermore, the models do not attend to non-homogeneity in feature counts, manifesting as partial observability in informing the policy. Additionally, interpretable structures are extracted post-learning instead of learning an interpretable model required for APC. To this end, we introduce a mathematical framework for KIPG methods that can (a) induce relevant feature counts over multi-relational features of the world, (b) handle latent non-homogeneous counts as hidden variables that are linear combinations of kernelized aggregates over the features, and (b) infuse knowledge as functional constraints in a principled manner. The study establishes a theory for imposing hard and soft constraints and simulates it through experiments. In comparison with knowledge-intensive baselines, we show quick sample efficient adaptation to new knowledge and interpretability in the learned policy, especially in a pandemic context.

preprint2021arXiv

SPACE: Structured Compression and Sharing of Representational Space for Continual Learning

Humans learn adaptively and efficiently throughout their lives. However, incrementally learning tasks causes artificial neural networks to overwrite relevant information learned about older tasks, resulting in 'Catastrophic Forgetting'. Efforts to overcome this phenomenon often utilize resources poorly, for instance, by growing the network architecture or needing to save parametric importance scores, or violate data privacy between tasks. To tackle this, we propose SPACE, an algorithm that enables a network to learn continually and efficiently by partitioning the learnt space into a Core space, that serves as the condensed knowledge base over previously learned tasks, and a Residual space, which is akin to a scratch space for learning the current task. After learning each task, the Residual is analyzed for redundancy, both within itself and with the learnt Core space. A minimal number of extra dimensions required to explain the current task are added to the Core space and the remaining Residual is freed up for learning the next task. We evaluate our algorithm on P-MNIST, CIFAR and a sequence of 8 different datasets, and achieve comparable accuracy to the state-of-the-art methods while overcoming catastrophic forgetting. Additionally, our algorithm is well suited for practical use. The partitioning algorithm analyzes all layers in one shot, ensuring scalability to deeper networks. Moreover, the analysis of dimensions translates to filter-level sparsity, and the structured nature of the resulting architecture gives us up to 5x improvement in energy efficiency during task inference over the current state-of-the-art.

preprint2020arXiv

A Low Effort Approach to Structured CNN Design Using PCA

Deep learning models hold state of the art performance in many fields, yet their design is still based on heuristics or grid search methods that often result in overparametrized networks. This work proposes a method to analyze a trained network and deduce an optimized, compressed architecture that preserves accuracy while keeping computational costs tractable. Model compression is an active field of research that targets the problem of realizing deep learning models in hardware. However, most pruning methodologies tend to be experimental, requiring large compute and time intensive iterations of retraining the entire network. We introduce structure into model design by proposing a single shot analysis of a trained network that serves as a first order, low effort approach to dimensionality reduction, by using PCA (Principal Component Analysis). The proposed method simultaneously analyzes the activations of each layer and considers the dimensionality of the space described by the filters generating these activations. It optimizes the architecture in terms of number of layers, and number of filters per layer without any iterative retraining procedures, making it a viable, low effort technique to design efficient networks. We demonstrate the proposed methodology on AlexNet and VGG style networks on the CIFAR-10, CIFAR-100 and ImageNet datasets, and successfully achieve an optimized architecture with a reduction of up to 3.8X and 9X in the number of operations and parameters respectively, while trading off less than 1% accuracy. We also apply the method to MobileNet, and achieve 1.7X and 3.9X reduction in the number of operations and parameters respectively, while improving accuracy by almost one percentage point.

preprint2020arXiv

Conditionally Deep Hybrid Neural Networks Across Edge and Cloud

The pervasiveness of "Internet-of-Things" in our daily life has led to a recent surge in fog computing, encompassing a collaboration of cloud computing and edge intelligence. To that effect, deep learning has been a major driving force towards enabling such intelligent systems. However, growing model sizes in deep learning pose a significant challenge towards deployment in resource-constrained edge devices. Moreover, in a distributed intelligence environment, efficient workload distribution is necessary between edge and cloud systems. To address these challenges, we propose a conditionally deep hybrid neural network for enabling AI-based fog computing. The proposed network can be deployed in a distributed manner, consisting of quantized layers and early exits at the edge and full-precision layers on the cloud. During inference, if an early exit has high confidence in the classification results, it would allow samples to exit at the edge, and the deeper layers on the cloud are activated conditionally, which can lead to improved energy efficiency and inference latency. We perform an extensive design space exploration with the goal of minimizing energy consumption at the edge while achieving state-of-the-art classification accuracies on image classification tasks. We show that with binarized layers at the edge, the proposed conditional hybrid network can process 65% of inferences at the edge, leading to 5.5x computational energy reduction with minimal accuracy degradation on CIFAR-10 dataset. For the more complex dataset CIFAR-100, we observe that the proposed network with 4-bit quantization at the edge achieves 52% early classification at the edge with 4.8x energy reduction. The analysis gives us insights on designing efficient hybrid networks which achieve significantly higher energy efficiency than full-precision networks for edge-cloud based distributed intelligence systems.

preprint2020arXiv

Enabling Deep Spiking Neural Networks with Hybrid Conversion and Spike Timing Dependent Backpropagation

Spiking Neural Networks (SNNs) operate with asynchronous discrete events (or spikes) which can potentially lead to higher energy-efficiency in neuromorphic hardware implementations. Many works have shown that an SNN for inference can be formed by copying the weights from a trained Artificial Neural Network (ANN) and setting the firing threshold for each layer as the maximum input received in that layer. These type of converted SNNs require a large number of time steps to achieve competitive accuracy which diminishes the energy savings. The number of time steps can be reduced by training SNNs with spike-based backpropagation from scratch, but that is computationally expensive and slow. To address these challenges, we present a computationally-efficient training technique for deep SNNs. We propose a hybrid training methodology: 1) take a converted SNN and use its weights and thresholds as an initialization step for spike-based backpropagation, and 2) perform incremental spike-timing dependent backpropagation (STDB) on this carefully initialized network to obtain an SNN that converges within few epochs and requires fewer time steps for input processing. STDB is performed with a novel surrogate gradient function defined using neuron's spike time. The proposed training methodology converges in less than 20 epochs of spike-based backpropagation for most standard image classification datasets, thereby greatly reducing the training complexity compared to training SNNs from scratch. We perform experiments on CIFAR-10, CIFAR-100, and ImageNet datasets for both VGG and ResNet architectures. We achieve top-1 accuracy of 65.19% for ImageNet dataset on SNN with 250 time steps, which is 10X faster compared to converted SNNs with similar accuracy.

preprint2020arXiv

Enabling Spike-based Backpropagation for Training Deep Neural Network Architectures

Spiking Neural Networks (SNNs) have recently emerged as a prominent neural computing paradigm. However, the typical shallow SNN architectures have limited capacity for expressing complex representations while training deep SNNs using input spikes has not been successful so far. Diverse methods have been proposed to get around this issue such as converting off-the-shelf trained deep Artificial Neural Networks (ANNs) to SNNs. However, the ANN-SNN conversion scheme fails to capture the temporal dynamics of a spiking system. On the other hand, it is still a difficult problem to directly train deep SNNs using input spike events due to the discontinuous, non-differentiable nature of the spike generation function. To overcome this problem, we propose an approximate derivative method that accounts for the leaky behavior of LIF neurons. This method enables training deep convolutional SNNs directly (with input spike events) using spike-based backpropagation. Our experiments show the effectiveness of the proposed spike-based learning on deep networks (VGG and Residual architectures) by achieving the best classification accuracies in MNIST, SVHN and CIFAR-10 datasets compared to other SNNs trained with a spike-based learning. Moreover, we analyze sparse event-based computations to demonstrate the efficacy of the proposed SNN training method for inference operation in the spiking domain.

preprint2020arXiv

Explicitly Trained Spiking Sparsity in Spiking Neural Networks with Backpropagation

Spiking Neural Networks (SNNs) are being explored for their potential energy efficiency resulting from sparse, event-driven computations. Many recent works have demonstrated effective backpropagation for deep Spiking Neural Networks (SNNs) by approximating gradients over discontinuous neuron spikes or firing events. A beneficial side-effect of these surrogate gradient spiking backpropagation algorithms is that the spikes, which trigger additional computations, may now themselves be directly considered in the gradient calculations. We propose an explicit inclusion of spike counts in the loss function, along with a traditional error loss, causing the backpropagation learning algorithms to optimize weight parameters for both accuracy and spiking sparsity. As supported by existing theory of over-parameterized neural networks, there are many solution states with effectively equivalent accuracy. As such, appropriate weighting of the two loss goals during training in this multi-objective optimization process can yield an improvement in spiking sparsity without a significant loss of accuracy. We additionally explore a simulated annealing-inspired loss weighting technique to increase the weighting for sparsity as training time increases. Our preliminary results on the Cifar-10 dataset show up to 70.1% reduction in spiking activity with iso-accuracy compared to an equivalent SNN trained only for accuracy and up to 73.3% reduction in spiking activity if allowed a trade-off of 1% reduction in classification accuracy.

preprint2020arXiv

Fitted Q-Learning for Relational Domains

We consider the problem of Approximate Dynamic Programming in relational domains. Inspired by the success of fitted Q-learning methods in propositional settings, we develop the first relational fitted Q-learning algorithms by representing the value function and Bellman residuals. When we fit the Q-functions, we show how the two steps of Bellman operator; application and projection steps can be performed using a gradient-boosting technique. Our proposed framework performs reasonably well on standard domains without using domain models and using fewer training trajectories.

preprint2020arXiv

GENIEx: A Generalized Approach to Emulating Non-Ideality in Memristive Xbars using Neural Networks

The analog nature of computing in Memristive crossbars poses significant issues due to various non-idealities such as: parasitic resistances, non-linear I-V characteristics of the device etc. The non-idealities can have a detrimental impact on the functionality i.e. computational accuracy of crossbars. Past works have explored modeling the non-idealities using analytical techniques. However, several non-idealities have data dependent behavior. This can not be captured using analytical (non data-dependent) models thereby, limiting their suitability in predicting application accuracy. To address this, we propose a Generalized Approach to Emulating Non-Ideality in Memristive Crossbars using Neural Networks (GENIEx), which accurately captures the data-dependent nature of non-idealities. We perform extensive HSPICE simulations of crossbars with different voltage and conductance combinations. Following that, we train a neural network to learn the transfer characteristics of the non-ideal crossbar. Next, we build a functional simulator which includes key architectural facets such as \textit{tiling}, and \textit{bit-slicing} to analyze the impact of non-idealities on the classification accuracy of large-scale neural networks. We show that GENIEx achieves \textit{low} root mean square errors (RMSE) of $0.25$ and $0.7$ for low and high voltages, respectively, compared to HSPICE. Additionally, the GENIEx errors are $7\times$ and $12.8\times$ better than an analytical model which can only capture the linear non-idealities. Further, using the functional simulator and GENIEx, we demonstrate that an analytical model can overestimate the degradation in classification accuracy by $\ge 10\%$ on CIFAR-100 and $3.7\%$ on ImageNet datasets compared to GENIEx.

preprint2020arXiv

Hyperparameter Optimization in Binary Communication Networks for Neuromorphic Deployment

Training neural networks for neuromorphic deployment is non-trivial. There have been a variety of approaches proposed to adapt back-propagation or back-propagation-like algorithms appropriate for training. Considering that these networks often have very different performance characteristics than traditional neural networks, it is often unclear how to set either the network topology or the hyperparameters to achieve optimal performance. In this work, we introduce a Bayesian approach for optimizing the hyperparameters of an algorithm for training binary communication networks that can be deployed to neuromorphic hardware. We show that by optimizing the hyperparameters on this algorithm for each dataset, we can achieve improvements in accuracy over the previous state-of-the-art for this algorithm on each dataset (by up to 15 percent). This jump in performance continues to emphasize the potential when converting traditional neural networks to binary communication applicable to neuromorphic hardware.

preprint2020arXiv

IMAC: In-memory multi-bit Multiplication andACcumulation in 6T SRAM Array

`In-memory computing' is being widely explored as a novel computing paradigm to mitigate the well known memory bottleneck. This emerging paradigm aims at embedding some aspects of computations inside the memory array, thereby avoiding frequent and expensive movement of data between the compute unit and the storage memory. In-memory computing with respect to Silicon memories has been widely explored on various memory bit-cells. Embedding computation inside the 6 transistor (6T) SRAM array is of special interest since it is the most widely used on-chip memory. In this paper, we present a novel in-memory multiplication followed by accumulation operation capable of performing parallel dot products within 6T SRAM without any changes to the standard bitcell. We, further, study the effect of circuit non-idealities and process variations on the accuracy of the LeNet-5 and VGG neural network architectures against the MNIST and CIFAR-10 datasets, respectively. The proposed in-memory dot-product mechanism achieves 88.8% and 99% accuracy for the CIFAR-10 and MNIST, respectively. Compared to the standard von Neumann system, the proposed system is 6.24x better in energy consumption and 9.42x better in delay.

preprint2020arXiv

Inclusive prompt photon-jet correlations as a probe of gluon saturation in electron-nucleus scattering at small $x$

We compute the differential cross-section for inclusive prompt photon$+$quark production in deeply inelastic scattering of electrons off nuclei at small $x$ ($e+A$ DIS) in the framework of the Color Glass Condensate effective field theory. The result is expressed as a convolution of the leading order (in the strong coupling $α_{\mathrm{s}}$) impact factor for the process and universal dipole matrix elements, in the limit of hard photon transverse momentum relative to the nuclear saturation scale $Q_{s,A}(x)$. We perform a numerical study of this process for the kinematics of the Electron-Ion Collider (EIC), exploring in particular the azimuthal angle correlations between the final state photon and quark. We observe a systematic suppression and broadening pattern of the back-to-back peak in the relative azimuthal angle distribution, as the saturation scale is increased by replacing proton targets with gold nuclei. Our results suggest that photon+jet final states in inclusive $e+A$ DIS at high energies are in general a promising channel for exploring gluon saturation that is complementary to inclusive and diffractive dijet production. They also provide a sensitive empirical test of the universality of dipole matrix elements when compared to identical measurements in proton-nucleus collisions. However because photon+jet correlations at small $x$ in EIC kinematics require jet reconstruction at small $k_\perp$, it will be important to study their feasibility relative to photon-hadron correlations.

preprint2020arXiv

Inherent Adversarial Robustness of Deep Spiking Neural Networks: Effects of Discrete Input Encoding and Non-Linear Activations

In the recent quest for trustworthy neural networks, we present Spiking Neural Network (SNN) as a potential candidate for inherent robustness against adversarial attacks. In this work, we demonstrate that adversarial accuracy of SNNs under gradient-based attacks is higher than their non-spiking counterparts for CIFAR datasets on deep VGG and ResNet architectures, particularly in blackbox attack scenario. We attribute this robustness to two fundamental characteristics of SNNs and analyze their effects. First, we exhibit that input discretization introduced by the Poisson encoder improves adversarial robustness with reduced number of timesteps. Second, we quantify the amount of adversarial accuracy with increased leak rate in Leaky-Integrate-Fire (LIF) neurons. Our results suggest that SNNs trained with LIF neurons and smaller number of timesteps are more robust than the ones with IF (Integrate-Fire) neurons and larger number of timesteps. Also we overcome the bottleneck of creating gradient-based adversarial inputs in temporal domain by proposing a technique for crafting attacks from SNN

preprint2020arXiv

Relevant-features based Auxiliary Cells for Energy Efficient Detection of Natural Errors

Deep neural networks have demonstrated state-of-the-art performance on many classification tasks. However, they have no inherent capability to recognize when their predictions are wrong. There have been several efforts in the recent past to detect natural errors but the suggested mechanisms pose additional energy requirements. To address this issue, we propose an ensemble of classifiers at hidden layers to enable energy efficient detection of natural errors. In particular, we append Relevant-features based Auxiliary Cells (RACs) which are class specific binary linear classifiers trained on relevant features. The consensus of RACs is used to detect natural errors. Based on combined confidence of RACs, classification can be terminated early, thereby resulting in energy efficient detection. We demonstrate the effectiveness of our technique on various image classification datasets such as CIFAR-10, CIFAR-100 and Tiny-ImageNet.

preprint2020arXiv

RMP-SNN: Residual Membrane Potential Neuron for Enabling Deeper High-Accuracy and Low-Latency Spiking Neural Network

Spiking Neural Networks (SNNs) have recently attracted significant research interest as the third generation of artificial neural networks that can enable low-power event-driven data analytics. The best performing SNNs for image recognition tasks are obtained by converting a trained Analog Neural Network (ANN), consisting of Rectified Linear Units (ReLU), to SNN composed of integrate-and-fire neurons with "proper" firing thresholds. The converted SNNs typically incur loss in accuracy compared to that provided by the original ANN and require sizable number of inference time-steps to achieve the best accuracy. We find that performance degradation in the converted SNN stems from using "hard reset" spiking neuron that is driven to fixed reset potential once its membrane potential exceeds the firing threshold, leading to information loss during SNN inference. We propose ANN-SNN conversion using "soft reset" spiking neuron model, referred to as Residual Membrane Potential (RMP) spiking neuron, which retains the "residual" membrane potential above threshold at the firing instants. We demonstrate near loss-less ANN-SNN conversion using RMP neurons for VGG-16, ResNet-20, and ResNet-34 SNNs on challenging datasets including CIFAR-10 (93.63% top-1), CIFAR-100 (70.93% top-1), and ImageNet (73.09% top-1 accuracy). Our results also show that RMP-SNN surpasses the best inference accuracy provided by the converted SNN with "hard reset" spiking neurons using 2-8 times fewer inference time-steps across network architectures and datasets.

preprint2020arXiv

RxNN: A Framework for Evaluating Deep Neural Networks on Resistive Crossbars

Resistive crossbars designed with non-volatile memory devices have emerged as promising building blocks for Deep Neural Network (DNN) hardware, due to their ability to compactly and efficiently realize vector-matrix multiplication (VMM), the dominant computational kernel in DNNs. However, a key challenge with resistive crossbars is that they suffer from a range of device and circuit level non-idealities such as interconnect parasitics, peripheral circuits, sneak paths, and process variations. These non-idealities can lead to errors in VMMs, eventually degrading the DNN's accuracy. It is therefore critical to study the impact of crossbar non-idealities on the accuracy of large-scale DNNs. However, this is challenging because existing device and circuit models are too slow to use in application-level evaluations. We present RxNN, a fast and accurate simulation framework to evaluate large-scale DNNs on resistive crossbar systems. RxNN splits and maps the computations involved in each DNN layer into crossbar operations, and evaluates them using a Fast Crossbar Model (FCM) that accurately captures the errors arising due to crossbar non-idealities while being four-to-five orders of magnitude faster than circuit simulation. FCM models a crossbar-based VMM operation using three stages - non-linear models for the input and output peripheral circuits (DACs and ADCs), and an equivalent non-ideal conductance matrix for the core crossbar array. We implement RxNN by extending the Caffe machine learning framework and use it to evaluate a suite of six large-scale DNNs developed for the ImageNet Challenge. Our experiments reveal that resistive crossbar non-idealities can lead to significant accuracy degradations (9.6%-32%) for these large-scale DNNs. To the best of our knowledge, this work is the first quantitative evaluation of the accuracy of large-scale DNNs on resistive crossbar based hardware.

preprint2020arXiv

sBSNN: Stochastic-Bits Enabled Binary Spiking Neural Network with On-Chip Learning for Energy Efficient Neuromorphic Computing at the Edge

In this work, we propose stochastic Binary Spiking Neural Network (sBSNN) composed of stochastic spiking neurons and binary synapses (stochastic only during training) that computes probabilistically with one-bit precision for power-efficient and memory-compressed neuromorphic computing. We present an energy-efficient implementation of the proposed sBSNN using 'stochastic bit' as the core computational primitive to realize the stochastic neurons and synapses, which are fabricated in 90nm CMOS process, to achieve efficient on-chip training and inference for image recognition tasks. The measured data shows that the 'stochastic bit' can be programmed to mimic spiking neurons, and stochastic Spike Timing Dependent Plasticity (or sSTDP) rule for training the binary synaptic weights without expensive random number generators. Our results indicate that the proposed sBSNN realization offers possibility of up to 32x neuronal and synaptic memory compression compared to full precision (32-bit) SNN and energy efficiency of 89.49 TOPS/Watt for two-layer fully-connected SNN.

preprint2020arXiv

Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks

Event-based cameras display great potential for a variety of tasks such as high-speed motion detection and navigation in low-light environments where conventional frame-based cameras suffer critically. This is attributed to their high temporal resolution, high dynamic range, and low-power consumption. However, conventional computer vision methods as well as deep Analog Neural Networks (ANNs) are not suited to work well with the asynchronous and discrete nature of event camera outputs. Spiking Neural Networks (SNNs) serve as ideal paradigms to handle event camera outputs, but deep SNNs suffer in terms of performance due to the spike vanishing phenomenon. To overcome these issues, we present Spike-FlowNet, a deep hybrid neural network architecture integrating SNNs and ANNs for efficiently estimating optical flow from sparse event camera outputs without sacrificing the performance. The network is end-to-end trained with self-supervised learning on Multi-Vehicle Stereo Event Camera (MVSEC) dataset. Spike-FlowNet outperforms its corresponding ANN-based method in terms of the optical flow prediction capability while providing significant computational efficiency.

preprint2020arXiv

Towards Understanding the Effect of Leak in Spiking Neural Networks

Spiking Neural Networks (SNNs) are being explored to emulate the astounding capabilities of human brain that can learn and compute functions robustly and efficiently with noisy spiking activities. A variety of spiking neuron models have been proposed to resemble biological neuronal functionalities. With varying levels of bio-fidelity, these models often contain a leak path in their internal states, called membrane potentials. While the leaky models have been argued as more bioplausible, a comparative analysis between models with and without leak from a purely computational point of view demands attention. In this paper, we investigate the questions regarding the justification of leak and the pros and cons of using leaky behavior. Our experimental results reveal that leaky neuron model provides improved robustness and better generalization compared to models with no leak. However, leak decreases the sparsity of computation contrary to the common notion. Through a frequency domain analysis, we demonstrate the effect of leak in eliminating the high-frequency components from the input, thus enabling SNNs to be more robust against noisy spike-inputs.

preprint2020arXiv

Vec2Face: Unveil Human Faces from their Blackbox Features in Face Recognition

Unveiling face images of a subject given his/her high-level representations extracted from a blackbox Face Recognition engine is extremely challenging. It is because the limitations of accessible information from that engine including its structure and uninterpretable extracted features. This paper presents a novel generative structure with Bijective Metric Learning, namely Bijective Generative Adversarial Networks in a Distillation framework (DiBiGAN), for synthesizing faces of an identity given that person's features. In order to effectively address this problem, this work firstly introduces a bijective metric so that the distance measurement and metric learning process can be directly adopted in image domain for an image reconstruction task. Secondly, a distillation process is introduced to maximize the information exploited from the blackbox face recognition engine. Then a Feature-Conditional Generator Structure with Exponential Weighting Strategy is presented for a more robust generator that can synthesize realistic faces with ID preservation. Results on several benchmarking datasets including CelebA, LFW, AgeDB, CFP-FP against matching engines have demonstrated the effectiveness of DiBiGAN on both image realism and ID preservation properties.

preprint2019arXiv

Constructing Energy-efficient Mixed-precision Neural Networks through Principal Component Analysis for Edge Intelligence

The `Internet of Things' has brought increased demand for AI-based edge computing in applications ranging from healthcare monitoring systems to autonomous vehicles. Quantization is a powerful tool to address the growing computational cost of such applications, and yields significant compression over full-precision networks. However, quantization can result in substantial loss of performance for complex image classification tasks. To address this, we propose a Principal Component Analysis (PCA) driven methodology to identify the important layers of a binary network, and design mixed-precision networks. The proposed Hybrid-Net achieves a more than 10% improvement in classification accuracy over binary networks such as XNOR-Net for ResNet and VGG architectures on CIFAR-100 and ImageNet datasets while still achieving up to 94% of the energy-efficiency of XNOR-Nets. This work furthers the feasibility of using highly compressed neural networks for energy-efficient neural computing in edge devices.

preprint2019arXiv

Controlled Forgetting: Targeted Stimulation and Dopaminergic Plasticity Modulation for Unsupervised Lifelong Learning in Spiking Neural Networks

Stochastic gradient descent requires that training samples be drawn from a uniformly random distribution of the data. For a deployed system that must learn online from an uncontrolled and unknown environment, the ordering of input samples often fails to meet this criterion, making lifelong learning a difficult challenge. We exploit the locality of the unsupervised Spike Timing Dependent Plasticity (STDP) learning rule to target local representations in a Spiking Neural Network (SNN) to adapt to novel information while protecting essential information in the remainder of the SNN from catastrophic forgetting. In our Controlled Forgetting Networks (CFNs), novel information triggers stimulated firing and heterogeneously modulated plasticity, inspired by biological dopamine signals, to cause rapid and isolated adaptation in the synapses of neurons associated with outlier information. This targeting controls the forgetting process in a way that reduces the degradation of accuracy for older tasks while learning new tasks. Our experimental results on the MNIST dataset validate the capability of CFNs to learn successfully over time from an unknown, changing environment, achieving 95.36% accuracy, which we believe is the best unsupervised accuracy ever achieved by a fixed-size, single-layer SNN on a completely disjoint MNIST dataset.

preprint2019arXiv

Extracting many-body correlators of saturated gluons with precision from inclusive photon+dijet final states in deeply inelastic scattering

We highlight the principal results of a computation in the Color Glass Condensate effective field theory (CGC EFT) of the next-to-leading order (NLO) impact factor for inclusive photon+dijet production at Bjorken $x_{\rm Bj} \ll 1$ in deeply inelastic electron-nucleus (e+A DIS) collisions. When combined with extant results for next-to-leading log $x_{\rm Bj}$ JIMWLK renormalization group (RG) evolution of gauge invariant two-point ("dipole") and four-point ("quadrupole") correlators of light-like Wilson lines, the inclusive photon+dijet e+A DIS cross-section can be determined to $\sim 10$\% accuracy. Our computation simultaneously provides the ingredients to compute fully inclusive DIS, inclusive photon, inclusive dijet and inclusive photon+jet channels to the same accuracy. This makes feasible quantitative extraction of many-body correlators of saturated gluons and precise determination of the saturation scale $Q_{S,A}(x_{\rm Bj})$ at a future Electron-Ion Collider. An interesting feature of our NLO result is the structure of the violation of the soft gluon theorem in the Regge limit. Another is the appearance in gluon emission of time-like non-global logs which also satisfy JIMWLK RG evolution.

preprint2019arXiv

Incremental Learning in Deep Convolutional Neural Networks Using Partial Network Sharing

Deep convolutional neural network (DCNN) based supervised learning is a widely practiced approach for large-scale image classification. However, retraining these large networks to accommodate new, previously unseen data demands high computational time and energy requirements. Also, previously seen training samples may not be available at the time of retraining. We propose an efficient training methodology and incrementally growing DCNN to learn new tasks while sharing part of the base network. Our proposed methodology is inspired by transfer learning techniques, although it does not forget previously learned tasks. An updated network for learning new set of classes is formed using previously learned convolutional layers (shared from initial part of base network) with addition of few newly added convolutional kernels included in the later layers of the network. We employed a `clone-and-branch' technique which allows the network to learn new tasks one after another without any performance loss in old tasks. We evaluated the proposed scheme on several recognition applications. The classification accuracy achieved by our approach is comparable to the regular incremental learning approach (where networks are updated with new training samples only, without any network sharing), while achieving energy efficiency, reduction in storage requirements, memory access and training time.

preprint2019arXiv

NLO impact factor for inclusive photon$+$dijet production in $e+A$ DIS at small $x$

We compute the next-to-leading order (NLO) impact factor for inclusive photon $+$dijet production in electron-nucleus (e+A) deeply inelastic scattering (DIS) at small $x$. An important ingredient in our computation is the simple structure of ``shock wave" fermion and gluon propagators. This allows one to employ standard momentum space Feynman diagram techniques for higher order computations in the Regge limit of fixed $Q^2\gg Λ_{\rm QCD}^2$ and $x\rightarrow 0$. Our computations in the Color Glass Condensate (CGC) effective field theory include the resummation of all-twist power corrections $Q_s^2/Q^2$, where $Q_s$ is the saturation scale in the nucleus. We discuss the structure of ultraviolet, collinear and soft divergences in the CGC, and extract the leading logs in $x$; the structure of the corresponding rapidity divergences gives a nontrivial first principles derivation of the JIMWLK renormalization group evolution equation for multiparton lightlike Wilson line correlators. Explicit expressions are given for the $x$-independent $O(α_s)$ contributions that constitute the NLO impact factor. These results, combined with extant results on NLO JIMWLK evolution, provide the ingredients to compute the inclusive photon $+$ dijet cross-section at small $x$ to $O(α_s^3 \ln(x))$. First results for the NLO impact factor in inclusive dijet production are recovered in the soft photon limit. A byproduct of our computation is the LO photon+ 3 jet (quark-antiquark-gluon) cross-section.

preprint2016arXiv

Attention Tree: Learning Hierarchies of Visual Features for Large-Scale Image Recognition

One of the key challenges in machine learning is to design a computationally efficient multi-class classifier while maintaining the output accuracy and performance. In this paper, we present a tree-based classifier: Attention Tree (ATree) for large-scale image classification that uses recursive Adaboost training to construct a visual attention hierarchy. The proposed attention model is inspired from the biological 'selective tuning mechanism for cortical visual processing'. We exploit the inherent feature similarity across images in datasets to identify the input variability and use recursive optimization procedure, to determine data partitioning at each node, thereby, learning the attention hierarchy. A set of binary classifiers is organized on top of the learnt hierarchy to minimize the overall test-time complexity. The attention model maximizes the margins for the binary classifiers for optimal decision boundary modelling, leading to better performance at minimal complexity. The proposed framework has been evaluated on both Caltech-256 and SUN datasets and achieves accuracy improvement over state-of-the-art tree-based methods at significantly lower computational cost.

preprint2016arXiv

Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition

Deep learning neural networks have emerged as one of the most powerful classification tools for vision related applications. However, the computational and energy requirements associated with such deep nets can be quite high, and hence their energy-efficient implementation is of great interest. Although traditionally the entire network is utilized for the recognition of all inputs, we observe that the classification difficulty varies widely across inputs in real-world datasets; only a small fraction of inputs require the full computational effort of a network, while a large majority can be classified correctly with very low effort. In this paper, we propose Conditional Deep Learning (CDL) where the convolutional layer features are used to identify the variability in the difficulty of input instances and conditionally activate the deeper layers of the network. We achieve this by cascading a linear network of output neurons for each convolutional layer and monitoring the output of the linear network to decide whether classification can be terminated at the current stage or not. The proposed methodology thus enables the network to dynamically adjust the computational effort depending upon the difficulty of the input data while maintaining competitive classification accuracy. We evaluate our approach on the MNIST dataset. Our experiments demonstrate that our proposed CDL yields 1.91x reduction in average number of operations per input, which translates to 1.84x improvement in energy. In addition, our results show an improvement in classification accuracy from 97.5% to 98.9% as compared to the original network.

preprint2016arXiv

Energy-Efficient Object Detection using Semantic Decomposition

Machine-learning algorithms offer immense possibilities in the development of several cognitive applications. In fact, large scale machine-learning classifiers now represent the state-of-the-art in a wide range of object detection/classification problems. However, the network complexities of large-scale classifiers present them as one of the most challenging and energy intensive workloads across the computing spectrum. In this paper, we present a new approach to optimize energy efficiency of object detection tasks using semantic decomposition to build a hierarchical classification framework. We observe that certain semantic information like color/texture are common across various images in real-world datasets for object detection applications. We exploit these common semantic features to distinguish the objects of interest from the remaining inputs (non-objects of interest) in a dataset at a lower computational effort. We propose a 2-stage hierarchical classification framework, with increasing levels of complexity, wherein the first stage is trained to recognize the broad representative semantic features relevant to the object of interest. The first stage rejects the input instances that do not have the representative features and passes only the relevant instances to the second stage. Our methodology thus allows us to reject certain information at lower complexity and utilize the full computational effort of a network only on a smaller fraction of inputs to perform detection. We use color and texture as distinctive traits to carry out several experiments for object detection. Our experiments on the Caltech101/CIFAR10 dataset show that the proposed method yields 1.93x/1.46x improvement in average energy, respectively, over the traditional single classifier model.

preprint2016arXiv

Hybrid Spintronic-CMOS Spiking Neural Network With On-Chip Learning: Devices, Circuits and Systems

Over the past decade Spiking Neural Networks (SNN) have emerged as one of the popular architectures to emulate the brain. In SNN, information is temporally encoded and communication between neurons is accomplished by means of spikes. In such networks, spike-timing dependent plasticity mechanisms require the online programming of synapses based on the temporal information of spikes transmitted by spiking neurons. In this work, we propose a spintronic synapse with decoupled spike transmission and programming current paths. The spintronic synapse consists of a ferromagnet-heavy metal heterostructure where programming current through the heavy metal generates spin-orbit torque to modulate the device conductance. Low programming energy and fast programming times demonstrate the efficacy of the proposed device as a nanoelectronic synapse. We perform a simulation study based on an experimentally benchmarked device-simulation framework to demonstrate the interfacing of such spintronic synapses with CMOS neurons and learning circuits operating in transistor sub-threshold region to form a network of spiking neurons that can be utilized for pattern recognition problems.

preprint2016arXiv

Ising spin model using Spin-Hall Effect (SHE) induced magnetization reversal in Magnetic-Tunnel-Junction

Ising spin model is considered as an efficient computing method to solve combinatorial optimization problems based on its natural tendency of convergence towards low energy state. The underlying basic functions facilitating the Ising model can be categorized into two parts, "Annealing and Majority vote". In this paper, we propose an Ising cell based on Spin Hall Effect (SHE) induced magnetization switching in a Magnetic Tunnel Junction (MTJ). The stochasticity of our proposed Ising cell based on SHE induced MTJ switching, can implement the natural annealing process by preventing the system from being stuck in solutions with local minima. Further, by controlling the current through the Heavy-Metal (HM) underlying the MTJ, we can mimic the majority vote function which determines the next state of the individual spins. By solving coupled \textit{Landau-Lifshitz-Gilbert} (LLG) equations, we demonstrate that our Ising cell can be replicated to map certain combinatorial problems. We present results for two representative problems - Maximum-cut and Graph coloring - to illustrate the feasibility of the proposed device-circuit configuration in solving combinatorial problems. Our proposed solution using a Heavy Metal (HM) based MTJ device can be exploited to implement compact, fast, and energy efficient Ising spin model.

preprint2016arXiv

MESL: Proposal for a Non-volatile Cascadable Magneto-Electric Spin Logic

In the quest for novel, scalable and energy-efficient computing technologies, many non-charge based logic devices are being explored. Recent advances in multi-ferroic materials have paved the way for electric field induced low energy and fast switching of nano-magnets using the magneto-electric (ME) effect. In this paper, we propose a voltage driven logic-device based on the ME induced switching of nano-magnets. We further demonstrate that the proposed logic-device, which exhibits decoupled read and write paths, can be used to construct a complete logic family including XNOR, NAND and NOR gates. The proposed logic family shows good scalability with a quadratic dependence of switching energy with respect to the switching voltage. Further, the proposed logic-device has better robustness against the effect of thermal noise as compared to the conventional current driven switching of nano-magnets. A device-to-circuit level coupled simulation framework, including magnetization dynamics and electron transport model, has been developed for analyzing the present proposal. Using our simulation framework, we present energy and delay results for the proposed Magneto-Electric Spin Logic (MESL) gates.

preprint2016arXiv

Probabilistic Deep Spiking Neural Systems Enabled by Magnetic Tunnel Junction

Deep Spiking Neural Networks are becoming increasingly powerful tools for cognitive computing platforms. However, most of the existing literature on such computing models are developed with limited insights on the underlying hardware implementation, resulting in area and power expensive designs. Although several neuromimetic devices emulating neural operations have been proposed recently, their functionality has been limited to very simple neural models that may prove to be inefficient at complex recognition tasks. In this work, we venture into the relatively unexplored area of utilizing the inherent device stochasticity of such neuromimetic devices to model complex neural functionalities in a probabilistic framework in the time domain. We consider the implementation of a Deep Spiking Neural Network capable of performing high accuracy and low latency classification tasks where the neural computing unit is enabled by the stochastic switching behavior of a Magnetic Tunnel Junction. Simulation studies indicate an energy improvement of $20\times$ over a baseline CMOS design in $45nm$ technology.

preprint2016arXiv

Proposal for an All-Spin Artificial Neural Network: Emulating Neural and Synaptic Functionalities Through Domain Wall Motion in Ferromagnets

Non-Boolean computing based on emerging post-CMOS technologies can potentially pave the way for low-power neural computing platforms. However, existing work on such emerging neuromorphic architectures have either focused on solely mimicking the neuron, or the synapse functionality. While memristive devices have been proposed to emulate biological synapses, spintronic devices have proved to be efficient at performing the thresholding operation of the neuron at ultra-low currents. In this work, we propose an All-Spin Artificial Neural Network where a single spintronic device acts as the basic building block of the system. The device offers a direct mapping to synapse and neuron functionalities in the brain while inter-layer network communication is accomplished via CMOS transistors. To the best of our knowledge, this is the first demonstration of a neural architecture where a single nanoelectronic device is able to mimic both neurons and synapses. The ultra-low voltage operation of low resistance magneto-metallic neurons enables the low-voltage operation of the array of spintronic synapses, thereby leading to ultra-low power neural architectures. Device-level simulations, calibrated to experimental results, was used to drive the circuit and system level simulations of the neural network for a standard pattern recognition problem. Simulation studies indicate energy savings by ~ 100x in comparison to a corresponding digital/ analog CMOS neuron implementation.

preprint2016arXiv

Unsupervised Regenerative Learning of Hierarchical Features in Spiking Deep Networks for Object Recognition

We present a spike-based unsupervised regenerative learning scheme to train Spiking Deep Networks (SpikeCNN) for object recognition problems using biologically realistic leaky integrate-and-fire neurons. The training methodology is based on the Auto-Encoder learning model wherein the hierarchical network is trained layer wise using the encoder-decoder principle. Regenerative learning uses spike-timing information and inherent latencies to update the weights and learn representative levels for each convolutional layer in an unsupervised manner. The features learnt from the final layer in the hierarchy are then fed to an output layer. The output layer is trained with supervision by showing a fraction of the labeled training dataset and performs the overall classification of the input. Our proposed methodology yields 0.92%/29.84% classification error on MNIST/CIFAR10 datasets which is comparable with state-of-the-art results. The proposed methodology also introduces sparsity in the hierarchical feature representations on account of event-based coding resulting in computationally efficient learning.

preprint2016arXiv

Yield, Area and Energy Optimization in Stt-MRAMs using failure aware ECC

Spin Transfer Torque MRAMs are attractive due to their non-volatility, high density and zero leakage. However, STT-MRAMs suffer from poor reliability due to shared read and write paths. Additionally, conflicting requirements for data retention and write-ability (both related to the energy barrier height of the magnet) makes design more challenging. Furthermore, the energy barrier height depends on the physical dimensions of the free layer. Any variations in the dimensions of the free layer lead to variations in the energy barrier height. In order to address poor reliability of STT-MRAMs, usage of Error Correcting Codes (ECC) have been proposed. Unlike traditional CMOS memory technologies, ECC is expected to correct both soft and hard errors in STT_MRAMs. To achieve acceptable yield with low write power, stronger ECC is required, resulting in increased number of encoded bits and degraded memory efficiency. In this paper, we propose Failure aware ECC (FaECC), which masks permanent faults while maintaining the same correction capability for soft errors without increased encoded bits. Furthermore, we investigate the impact of process variations on run-time reliability of STT-MRAMs. We provide an analysis on the impact of process variations on the life-time of the free layer and retention failures. In order to analyze the effectiveness of our methodology, we developed a cross-layer simulation framework that consists of device, circuit and array level analysis of STT-MRAM memory arrays. Our results show that using FaECC relaxes the requirements on the energy barrier height, which reduces the write energy and results in smaller access transistor size and memory array area. Keywords: STT-MRAM, reliability, Error Correcting Codes, ECC, magnetic memory

preprint2015arXiv

Energy Efficient and High Performance Current-Mode Neural Network Circuit using Memristors and Digitally Assisted Analog CMOS Neurons

Emerging nano-scale programmable Resistive-RAM (RRAM) has been identified as a promising technology for implementing brain-inspired computing hardware. Several neural network architectures, that essentially involve computation of scalar products between input data vectors and stored network weights can be efficiently implemented using high density cross-bar arrays of RRAM integrated with CMOS. In such a design, the CMOS interface may be responsible for providing input excitations and for processing the RRAM output. In order to achieve high energy efficiency along with high integration density in RRAM based neuromorphic hardware, the design of RRAM-CMOS interface can therefore play a major role. In this work we propose design of high performance, current mode CMOS interface for RRAM based neural network design. The use of current mode excitation for input interface and design of digitally assisted current-mode CMOS neuron circuit for the output interface is presented. The proposed technique achieve 10x energy as well as performance improvement over conventional approaches employed in literature. Network level simulations show that the proposed scheme can achieve 2 orders of magnitude lower energy dissipation as compared to a digital ASIC implementation of a feed-forward neural network.

preprint2015arXiv

Exploring Spin-Transfer-Torque Devices for Logic Applications

As CMOS nears the end of the projected scaling roadmap, significant effort has been devoted to the search for new materials and devices that can realize memory and logic. Spintronics, is one of the promising directions for the Post-CMOS era. While the potential of spintronic memories is relatively well known, realizing logic remains an open and critical challenge. All Spin Logic (ASL) is a recently proposed logic style that realizes Boolean logic using spin-transfer-torque (STT) devices based on the principle of non-local spin torque. ASL has advantages such as density, non-volatility, and low operating voltage. However, it also suffers from drawbacks such as low speed and static power dissipation. Recent work has shown that, in the context of simple arithmetic circuits (adders, multipliers), the efficiency of ASL can be greatly improved using techniques that utilize its unique characteristics. An evaluation of ASL across a broad range of circuits, considering the known optimization techniques, is an important next step in determining its viability. In this work, we propose a systematic methodology for the synthesis of ASL circuits. Our methodology performs various optimizations that benefit ASL, such as intra-cycle power gating, stacking of ASL nanomagnets, and fine-grained logic pipelining. We utilize the proposed methodology to evaluate the suitability of ASL implementations for a wide range of benchmarks viz. random combinational and sequential logic, digital signal processing circuits, and the Leon SPARC3 general-purpose processor. Based on our evaluation, we identify (i) the large current requirement of nanomagnets at fast switching speeds, (ii) the static power dissipation in the all-metallic devices, and (iii) the short spin flip length in interconnects as key bottlenecks that limit the competitiveness of ASL.

preprint2015arXiv

High Sensitivity Biosensor using Injection Locked Spin Torque Nano-Oscillators

With ever increasing research on magnetic nano systems it is shown to have great potential in the areas of magnetic storage, biosensing, magnetoresistive insulation etc. In the field of biosensing specifically Spin Valve sensors coupled with Magnetic Nanolabels is showing great promise due to noise immunity and energy efficiency [1]. In this paper we present the application of injection locked based Spin Torque Nano Oscillator (STNO) suitable for high resolution energy efficient labeled DNA Detection. The proposed STNO microarray consists of 20 such devices oscillating at different frequencies making it possible to multiplex all the signals using capacitive coupling. Frequency Division Multiplexing can be aided with Time division multiplexing to increase the device integration and decrease the readout time while maintaining the same efficiency in presence of constant input referred noise.

preprint2015arXiv

Modeling and Simulation of Spin Transfer Torque Generated at Topological Insulator/Ferromagnetic Heterostructure

Topological Insulator (TI) has recently emerged as an attractive candidate for possible application to spintronic circuits because of its strong spin orbit coupling. TIs are unique materials that have an insulating bulk but conducting surface states due to band inversion and these surface states are protected by time reversal symmetry. In this paper, we propose a physics-based spin dynamics simulation framework for TI/Ferromagnet (TI/FM) bilayer heterostructures that is able to capture the electronic band structure of a TI while calculating the electron and spin transport properties. Our model differs from TI/FM models proposed in the literature in that it is able to account for the 3D band structure of TIs and the effect of exchange coupling and external magnetic field on the band structure. Our proposed approach uses 2D surface Hamiltonian for TIs that includes all necessary features for spin transport calculations so as to properly model the characteristics of a TI/FM heterostructure. Using this Hamiltonian and appropriate parameters, we show that the effect of quantum confinement and exchange coupling are successfully captured in the calculated surface band structure compared with the quantum well band diagram of a 3D TI, and matches well with experimental data reported in the literature. We then show how this calibrated Hamiltonian is used with the self-consistent non equilibrium Green's functions (NEGF) formalism to determine the charge and spin transport in TI/FM bilayer heterostructures. Our calculations agree well with experimental data and capture the unique features of a TI/FM heterostructure such as high spin Hall angle, high spin conductivity etc. Finally, we show how the results obtained from NEGF calculations may be incorporated into the Landau-Lifshitz-Gilbert-Slonczewski (LLGS) formulation to simulate the magnetization dynamics of an FM layer sitting on top of a TI.

preprint2015arXiv

Spin-Torque Sensors for Energy Efficient High Speed Long Interconnects

In this paper, we propose a Spin-Torque (ST) based sensing scheme that can enable energy efficient multi-bit long distance interconnect architectures. Current-mode interconnects have recently been proposed to overcome the performance degradations associated with conventional voltage mode Copper (Cu) interconnects. However, the performance of current mode interconnects are limited by analog current sensing transceivers and equalization circuits. As a solution, we propose the use of ST based receivers that use Magnetic Tunnel Junctions (MTJ) and simple digital components for current-to-voltage conversion and do not require analog transceivers. We incorporate Spin-Hall Metal (SHM) in our design to achieve high speed sensing. We show both single and multi-bit operations that reveal major benefits at higher speeds. Our simulation results show that the proposed technique consumes only 3.93-4.72 fJ/bit/mm energy while operating at 1-2 Gbits/sec; which is considerably better than existing charge based interconnects. In addition, Voltage Controlled Magnetic Anisotropy (VCMA) can reduce the required current at the sensor. With the inclusion of VCMA, the energy consumption can be further reduced to 2.02-4.02 fJ/bit/mm

preprint2014arXiv

Design and Synthesis of Ultra Low Energy Spin-Memristor Threshold Logic

A threshold logic gate (TLG) performs weighted sum of multiple inputs and compares the sum with a threshold. We propose Spin-Memeristor Threshold Logic (SMTL) gates, which employ memristive cross-bar array (MCA) to perform current-mode summation of binary inputs, whereas, the low-voltage fast-switching spintronic threshold devices (STD) carry out the threshold operation in an energy efficient manner. Field programmable SMTL gate arrays can operate at a small terminal voltage of ~50mV, resulting in ultra-low power consumption in gates as well as programmable interconnect networks. We evaluate the performance of SMTL using threshold logic synthesis. Results for common benchmarks show that SMTL based programmable logic hardware can be more than 100x energy efficient than state of the art CMOS FPGA.

preprint2014arXiv

Hierarchical Temporal Memory Based on Spin-Neurons and Resistive Memory for Energy-Efficient Brain-Inspired Computing

Hierarchical temporal memory (HTM) tries to mimic the computing in cerebral-neocortex. It identifies spatial and temporal patterns in the input for making inferences. This may require large number of computationally expensive tasks like, dot-product evaluations. Nano-devices that can provide direct mapping for such primitives are of great interest. In this work we show that the computing blocks for HTM can be mapped using low-voltage, fast-switching, magneto-metallic spin-neurons combined with emerging resistive cross-bar network (RCN). Results show possibility of more than 200x lower energy as compared to 45nm CMOS ASIC design

preprint2014arXiv

Laser Induced Magnetization Reversal for Detection in Optical Interconnects

Optical interconnect has emerged as the front-runner to replace electrical interconnect especially for off-chip communication. However, a major drawback with optical interconnects is the need for photodetectors and amplifiers at the receiver, implemented usually by direct bandgap semiconductors and analog CMOS circuits, leading to large energy consumption and slow operating time. In this article, we propose a new optical interconnect architecture that uses a magnetic tunnel junction (MTJ) at the receiver side that is switched by femtosecond laser pulses. The state of the MTJ can be sensed using simple digital CMOS latches, resulting in significant improvement in energy consumption. Moreover, magnetization in the MTJ can be switched on the picoseconds time-scale and our design can operate at a speed of 5 Gbits/sec for a single link.

preprint2014arXiv

Multiple alignment of structures using center of proteins

In this paper we report on an algorithm for aligning multiple protein structures. The algorithm has been tested on a variety of inputs and it performs well in comparison to well-known algorithms for this problem.

preprint2014arXiv

Spin Orbit Torque Based Electronic Neuron

A device based on current-induced spin-orbit torque (SOT) that functions as an electronic neuron is proposed in this work. The SOT device implements an artificial neuron's thresholding (transfer) function. In the first step of a two-step switching scheme, a charge current places the magnetization of a nano-magnet along the hard-axis i.e. an unstable point for the magnet. In the second step, the SOT device (neuron) receives a current (from the synapses) which moves the magnetization from the unstable point to one of the two stable states. The polarity of the synaptic current encodes the excitatory and inhibitory nature of the neuron input, and determines the final orientation of the magnetization. A resistive crossbar array, functioning as synapses, generates a bipolar current that is a weighted sum of the inputs. The simulation of a two layer feed-forward Artificial Neural Network (ANN) based on the SOT electronic neuron shows that it consumes ~3X lower power than a 45nm digital CMOS implementation, while reaching ~80% accuracy in the classification of one hundred images of handwritten digits from the MNIST dataset.

preprint2014arXiv

Spin-Orbit Torque Induced Spike-Timing Dependent Plasticity

Nanoelectronic devices that mimic the functionality of synapses are a crucial requirement for performing cortical simulations of the brain. In this work we propose a ferromagnet-heavy metal heterostructure that employs spin-orbit torque to implement Spike-Timing Dependent Plasticity. The proposed device offers the advantage of decoupled spike transmission and programming current paths, thereby leading to reliable operation during online learning. Possible arrangement of such devices in a crosspoint architecture can pave the way for ultra-dense neural networks. Simulation studies indicate that the device has the potential of achieving pico-Joule level energy consumption (maximum 2 pJ per synaptic event) which is comparable to the energy consumption for synaptic events in biological synapses.

preprint2014arXiv

STT-SNN: A Spin-Transfer-Torque Based Soft-Limiting Non-Linear Neuron for Low-Power Artificial Neural Networks

Recent years have witnessed growing interest in the use of Artificial Neural Networks (ANNs) for vision, classification, and inference problems. An artificial neuron sums N weighted inputs and passes the result through a non-linear transfer function. Large-scale ANNs impose very high computing requirements for training and classification, leading to great interest in the use of post-CMOS devices to realize them in an energy efficient manner. In this paper, we propose a spin-transfer-torque (STT) device based on Domain Wall Motion (DWM) magnetic strip that can efficiently implement a Soft-limiting Non-linear Neuron (SNN) operating at ultra-low supply voltage and current. In contrast to previous spin-based neurons that can only realize hard-limiting transfer functions, the proposed STT-SNN displays a continuous resistance change with varying input current, and can therefore be employed to implement a soft-limiting neuron transfer function. Soft-limiting neurons are greatly preferred to hard-limiting ones due to their much improved modeling capacity, which leads to higher network accuracy and lower network complexity. We also present an ANN hardware design employing the proposed STT-SNNs and Memristor Crossbar Arrays (MCA) as synapses. The ultra-low voltage operation of the magneto metallic STT-SNN enables the programmable MCA-synapses, computing analog-domain weighted summation of input voltages, to also operate at ultra-low voltage. We modeled the STT-SNN using micro-magnetic simulation and evaluated them using an ANN for character recognition. Comparisons with analog and digital CMOS neurons show that STT-SNNs can achieve more than two orders of magnitude lower energy consumption.

preprint2013arXiv

Boolean and Non-Boolean Computation With Spin Devices

Recently several device and circuit design techniques have been explored for applying nano-magnets and spin torque devices like spin valves and domain wall magnets in computational hardware. However, most of them have been focused on digital logic, and, their benefits over robust and high performance CMOS remains debatable. Ultra-low voltage, current-switching operation of magneto-metallic spin torque devices can potentially be more suitable for non-Boolean computation schemes that can exploit current-mode analog processing. Device circuit co-design for different classes of non-Boolean-architectures using spin-torque based neuron models in spin-CMOS hybrid circuits show that the spin-based non-Boolean designs can achieve 15X-100X lower computation energy for applications like, image-processing, data-conversion, cognitive-computing, pattern matching and programmable-logic, as compared to state of art CMOS designs.

preprint2013arXiv

DSTT-MRAM: Differential Spin Hall MRAM for On-chip Memories

A new device structure for spin transfer torque based magnetic random access memory is proposed for on-chip memory applications. Our device structure exploits spin Hall effect to create a differential memory cell that exhibits fast and energy-efficient write operation. Moreover, due to inherently differential device structure, fast and reliable read operation can be performed. Our simulation study shows 10X improvement in write energy over the standard 1T1R STT-MRAM memory cell, and 1.6X faster read operation compared to single-ended sensing (as in standard 1T1R STT-MRAMs). The bit-cell characteristics are promising for high performance on-chip memory applications.

preprint2013arXiv

Energy-Efficient and Robust Associative Computing with Electrically Coupled Dual Pillar Spin-Torque Oscillators

Dynamics of coupled spin-torque oscillators can be exploited for non-Boolean information processing. However, the feasibility of coupling large number of STOs with energy-efficiency and sufficient robustness towards parameter-variation and thermal-noise, may be critical for such computing applications. In this work, the impacts of parameter-variation and thermal-noise on two different coupling mechanisms for STOs, namely, magnetic-coupling and electrical-coupling are analyzed. Magnetic coupling is simulated using dipolar-field interactions. For electricalcoupling we employed global RF-injection. In this method, multiple STOs are phase-locked to a common RF-signal that is injected into the STOs along with the DC bias. Results for variation and noise analysis indicate that electrical-coupling can be significantly more robust as compared to magnetic-coupling. For room-temperature simulations, appreciable phase-lock was retained among tens of electrically coupled STOs for up to 20% 3s random variations in critical device parameters. The magnetic-coupling technique however failed to retain locking beyond ~3% 3s parameter-variations, even for small-size STO clusters with near-neighborhood connectivity. We propose and analyze Dual-Pillar STO (DP-STO) for low-power computing using the proposed electrical coupling method. We observed that DP-STO can better exploit the electrical-coupling technique due to separation between the biasing RF signal and its own RF output.

preprint2013arXiv

Exploring Boolean and Non-Boolean Computing Applications of Spin Torque Devices

In this paper we discuss the potential of emerging spintorque devices for computing applications. Recent proposals for spinbased computing schemes may be differentiated as all-spin vs. hybrid, programmable vs. fixed, and, Boolean vs. non-Boolean. All spin logic-styles may offer high area-density due to small form-factor of nano-magnetic devices. However, circuit and system-level design techniques need to be explored that leverage the specific spin-device characteristics to achieve energy-efficiency, performance and reliability comparable to those of CMOS. The non-volatility of nanomagnets can be exploited in the design of energy and area-efficient programmable logic. In such logic-styles, spin-devices may play the dual-role of computing as well as memory-elements that provide field-programmability. Spin-based threshold logic design is presented as an example (dynamic resisitve threshold logic and magnetic threshold logic). Emerging spintronic phenomena may lead to ultralow- voltage, current-mode, spin-torque switches that can offer attractive computing capabilities, beyond digital switches. Such devices may be suitable for non-Boolean data-processing applications which involve analog processing. Integration of such spin-torque devices with charge-based devices like CMOS and resistive memory can lead to highly energy-efficient information processing hardware for applications like pattern-matching, neuromorphic-computing, image-processing and data-conversion. Towards the end, we discuss the possibility of applying emerging spin-torque switches in the design of energy-efficient global interconnects, for future chip multiprocessors.

preprint2013arXiv

Exploring Ultra Low-Power on-Chip Clocking Using Functionality Enhanced Spin-Torque Switches

Emerging spin-torque (ST) phenomena may lead to ultra-low-voltage, high-speed nano-magnetic switches. Such current-based-switches can be attractive for designing low swing global-interconnects, like, clocking-networks and databuses. In this work we present the basic idea of using such ST-switches for low-power on-chip clocking. For clockingnetworks, Spin-Hall-Effect (SHE) can be used to produce an assist-field for fast ST-switching using global-mesh-clock with less than 100mV swing. The ST-switch acts as a compact-latch, written by ultra-low-voltage input-pulses. The data is read using a high-resistance tunnel-junction. The clock-driven SHE write-assist can be shared among large number of ST-latches, thereby reducing the load-capacitance for clock-distribution. The SHE assist can be activated by a low-swing clock (~150mV) and hence can facilitate ultra-low voltage clock-distribution. Owing to reduced clock-load and low-voltage operation, the proposed scheme can achieve 97% low-power for on-chip clocking as compared to the state of the art CMOS design. Rigorous device-circuit simulations and system-level modelling for the proposed scheme will be addressed in future.

preprint2013arXiv

Spin Neurons: A Possible Path to Energy-Efficient Neuromorphic Computers

Recent years have witnessed growing interest in the field of brain-inspired computing based on neural-network architectures. In order to translate the related algorithmic models into powerful, yet energy-efficient cognitive-computing hardware, computing-devices beyond CMOS may need to be explored. The suitability of such devices to this field of computing would strongly depend upon how closely their physical characteristics match with the essential computing primitives employed in such models. In this work we discuss the rationale of applying emerging spin-torque devices for bio-inspired computing. Recent spin-torque experiments have shown the path to low-current, low-voltage and high-speed magnetization switching in nano-scale magnetic devices. Such magneto-metallic, current-mode spin-torque switches can mimic the analog summing and thresholding operation of an artificial neuron with high energy-efficiency. Comparison with CMOS-based analog circuit-model of neuron shows that spin neurons can achieve more than two orders of magnitude lower energy and beyond three orders of magnitude reduction in energy-delay product. The application of spin neurons can therefore be an attractive option for neuromorphic computers of future.

preprint2013arXiv

Spintronic Switches for Ultra Low Energy On-Chip and Inter-Chip Current-Mode Interconnects

Energy-efficiency and design-complexity of high-speed on-chip and inter-chip data-interconnects has emerged as the major bottleneck for high-performance computing-systems. As a solution, we propose an ultra-low energy interconnect design-scheme using nano-scale spintorque switches. In the proposed method, data is transmitted in the form of current-pulses, with amplitude of the order of few micro-amperes that flows across a small terminal-voltage of less than 50mV. Sub-nanosecond spintorque switching of scaled nano-magnets can be used to receive and convert such high-speed current-mode signal into binary voltage-levels using magnetic-tunnel-junction (MTJ), with the help of simple CMOS inverter. As a result of low-voltage, low-current signaling and minimal signal-conversion overhead, the proposed technique can facilitate highly compact and simplified designs for multi-gigahertz inter-chip and on-chip data-communication links. Such links can achieve more than ~100x higher energy-efficiency, as compared to state of the art CMOS interconnects.

preprint2013arXiv

Ultra Low Energy Analog Image Processing Using Spin Neurons

In this work we present an ultra low energy, 'on-sensor' image processing architecture, based on cellular array of spin based neurons. The 'neuron' constitutes of a lateral spin valve (LSV) with multiple input magnets, connected to an output magnet, using metal channels. The low resistance, magneto-metallic neurons operate at a small terminal voltage of ~20mV, while performing analog computation upon photo sensor inputs. The static current-flow across the device terminals is limited to small periods, corresponding to magnet switching time, and, is determined by a low duty-cycle system-clock. Thus, the energy-cost of analog-mode processing, inevitable in most image sensing applications, is reduced and made comparable to that of dynamic and leakage power consumption in peripheral CMOS units. Performance of the proposed architecture for some common image sensing and processing applications like, feature extraction, halftone compression and digitization, have been obtained through physics based device simulation framework, coupled with SPICE. Results indicate that the proposed design scheme can achieve more than two orders of magnitude reduction in computation energy, as compared to the state of art CMOS designs, that are based on conventional mixed-signal image acquisition and processing schemes. To the best of authors' knowledge, this is the first work where application of nano magnets (in LSV's) in analog signal processing has been proposed.

preprint2013arXiv

Ultra Low Power Associative Computing with Spin Neurons and Resistive Crossbar Memory

Emerging resistive-crossbar memory (RCM) technology can be promising for computationally-expensive analog pattern-matching tasks. However, the use of CMOS analog-circuits with RCM would result in large power-consumption and poor scalability, thereby eschewing the benefits of RCM-based computation. We propose the use of low-voltage, fast-switching, magneto-metallic spin-neurons for ultra low-power non-Boolean computing with RCM. We present the design of analog associative memory for face recognition using RCM, where, substituting conventional analog circuits with spin-neurons can achieve ~100x lower power. This makes the proposed design ~1000x more energy-efficient than a 45nm-CMOS digital ASIC, thereby significantly enhancing the prospects of RCM based computational hardware.

preprint2013arXiv

Ultra-High Density, High-Performance and Energy-Efficient All Spin Logic

All Spin Logic gates employ multiple nano-magnets interacting through spin-torque using non-magnetic channels. Compactness, non-volatility and ultra-low voltage operation are some of the attractive features of ASL, while, low switching-speed (of nano-magnets as compared to CMOS gates) and static-power dissipation can be identified as the major bottlenecks. In this work we explore design techniques that leverage the specific device characteristics of ASL to overcome the inefficiencies and to enhance the merits of this technology, for a given set of device parameters. We exploit the non-volatility of nano-magnets to model fully-pipelined ASL that can achieve higher performance. Clocking of power supply in pipelined ASL would require CMOS transistors that may consume significantly large voltage headroom and area, as compared to the nano-magnets. We show that the use of leaky transistors can significantly mitigate such bottlenecks, without sacrificing energy-efficiency and robustness. Exploiting the inherent isolation between the biasing charge current and spin-current paths in ASL, we propose to stack multiple ASL metal layers, leading to ultra-high-density and energy-efficient 3-D computation blocks. Results for the design of an FIR filter show that ASL can achieve performance and power consumption comparable to CMOS while the ultra-high-density of ASL can be projected as its main advantage over CMOS.

preprint2013arXiv

Ultra-low Energy, High Performance and Programmable Magnetic Threshold Logic

We propose magnetic threshold-logic (MTL) design based on non-volatile spin-torque switches. A threshold logic gate (TLG) performs summation of multiple inputs multiplied by a fixed set of weights and compares the sum with a threshold. MTL employs resistive states of magnetic tunnel junctions as programmable input weights, while, a low-voltage domain-wall shift based spin-torque switch is used for thresholding operation. The resulting MTL gate acts as a low-power, configurable logic unit and can be used to build fully pipelined, high-performance programmable computing blocks. Multiple stages in such a MTL design can be connected using energy-efficient ultralow swing programmable interconnect networks based on resistive switches. Owing to memory-based compact logic and interconnect design and low-voltage, high-speed spintorque based threshold operation, MTL can achieve more than two orders of magnitude improvement in energy-delay product as compared to look-up table based CMOS FPGA.

preprint2013arXiv

Ultra-low Energy, High-Performance Dynamic Resistive Threshold Logic

We propose dynamic resistive threshold-logic (DRTL) design based on non-volatile resistive memory. A threshold logic gate (TLG) performs summation of multiple inputs multiplied by a fixed set of weights and compares the sum with a threshold. DRTL employs resistive memory elements to implement the weights and the thresholds, while a compact dynamic CMOS latch is used for the comparison operation. The resulting DRTL gate acts as a low-power, configurable dynamic logic unit and can be used to build fully pipelined, high-performance programmable computing blocks. Multiple stages in such a DRTL design can be connected using energy-efficient low swing programmable interconnect networks based on resistive switches. Owing to memory-based compact logic and interconnect design and highspeed dynamic-pipelined operation, DRTL can achieve more than two orders of magnitude improvement in energy-delay product as compared to look-up table based CMOS FPGA.

preprint2012arXiv

Proposal For Neuromorphic Hardware Using Spin Devices

We present a design-scheme for ultra-low power neuromorphic hardware using emerging spin-devices. We propose device models for 'neuron', based on lateral spin valves and domain wall magnets that can operate at ultra-low terminal voltage of ~20 mV, resulting in small computation energy. Magnetic tunnel junctions are employed for interfacing the spin-neurons with charge-based devices like CMOS, for large-scale networks. Device-circuit co-simulation-framework is used for simulating such hybrid designs, in order to evaluate system-level performance. We present the design of different classes of neuromorphic architectures using the proposed scheme that can be suitable for different applications like, analog-data-sensing, data-conversion, cognitive-computing, associative memory, programmable-logic and analog and digital signal processing. We show that the spin-based neuromorphic designs can achieve 15X-300X lower computation energy for these applications; as compared to state of art CMOS designs.

preprint2012arXiv

Spin-Based Neuron Model with Domain Wall Magnets as Synapse

We present artificial neural network design using spin devices that achieves ultra low voltage operation, low power consumption, high speed, and high integration density. We employ spin torque switched nano-magnets for modelling neuron and domain wall magnets for compact, programmable synapses. The spin based neuron-synapse units operate locally at ultra low supply voltage of 30mV resulting in low computation power. CMOS based inter-neuron communication is employed to realize network-level functionality. We corroborate circuit operation with physics based models developed for the spin devices. Simulation results for character recognition as a benchmark application shows 95% lower power consumption as compared to 45nm CMOS design.

preprint2011arXiv

Magnonic spin-transfer torque MRAM with low power, high speed, and error-free switching

A new class of spin-transfer torque magnetic random access memory (STT-MRAM) is discussed, in which writing is achieved using thermally initiated magnonic current pulses as an alternative to conventional electric current pulses. The magnonic pulses are used to destabilize the magnetic free layer from its initial direction, and are followed immediately by a bipolar electric current exerting conventional spin-transfer torque on the free layer. The combination of thermal and electric currents greatly reduces switching errors, and simultaneously reduces the electric switching current density by more than an order of magnitude as compared to conventional STT-MRAM. The energy efficiency of several possible electro-thermal circuit designs have been analyzed numerically. As compared to STT-MRAM with perpendicular magnetic anisotropy, magnonic STT-MRAM reduces the overall switching energy by almost 80%. Furthermore, the lower electric current density allows the use of thicker tunnel barriers, which should result in higher tunneling magneto-resistance and improved tunnel barrier reliability. The combination of lower power, improved reliability, higher integration density, and larger read margin make magnonic STT-MRAM a promising choice for future non-volatile storage.

preprint2011arXiv

Thermoelectric Spin-Transfer Torque MRAM with Sub-Nanosecond Bi-Directional Writing using Magnonic Current

A new genre of Spin-Transfer Torque (STT) MRAM is proposed, in which bi-directional writing is achieved using thermoelectrically controlled magnonic current as an alternative to conventional electric current. The device uses a magnetic tunnel junction (MTJ), which is adjacent to a non-magnetic metallic and a ferrite film. This film stack is heated or cooled by a Peltier element which creates a bi-directional magnonic pulse in the ferrite film. Conversion of magnons to spin current occurs at the ferrite-metal interface, and the resulting spin-transfer torque is used to achieve sub-nanosecond precessional switching of the ferromagnetic free layer in the MTJ. Compared to electric current driven STT-MRAM with perpendicular magnetic anisotropy (PMA), thermoelectric STT-MRAM reduces the overall magnetization switching energy by more than 40% for nano-second switching, combined with a write error rate (WER) of less than 10-9 and a lifetime of 10 years or higher. The combination of higher thermal activation energy, sub-nanosecond read/write speed, improved tunneling magneto-resistance (TMR) and tunnel barrier reliability make thermoelectric STT-MRAM a promising choice for future non-volatile memory applications.

preprint2010arXiv

Improved current saturation and shifted switching threshold voltage in In2O3 nanowire based, fully transparent NMOS inverters via femtosecond laser annealing

Transistors based on various types of non-silicon nanowires have shown great potential for a variety of applications, especially for those require transparency and low-temperature substrates. However, critical requirements for circuit functionality such as saturated source-drain current, and matched threshold voltages of individual nanowire transistors in a way that is compatible with low temperature substrates, have not been achieved. Here we show that femtosecond laser pulses can anneal individual transistors based on In2O3 nanowires, improve the saturation of the source-drain current, and permanently shift the threshold voltage to the positive direction. We applied this technique and successfully shifted the switching threshold voltages of NMOS based inverters and improved their noise margin, in both depletion and enhancement modes. Our demonstration provides a method to trim the parameters of individual nanowire transistors, and suggests potential for large-scale integration of nanowire-based circuit blocks and systems.

preprint2007arXiv

Modeling and Analysis of Loading Effect in Leakage of Nano-Scaled Bulk-CMOS Logic Circuits

In nanometer scaled CMOS devices significant increase in the subthreshold, the gate and the reverse biased junction band-to-band-tunneling (BTBT) leakage, results in the large increase of total leakage power in a logic circuit. Leakage components interact with each other in device level (through device geometry, doping profile) and also in the circuit level (through node voltages). Due to the circuit level interaction of the different leakage components, the leakage of a logic gate strongly depends on the circuit topology i.e. number and nature of the other logic gates connected to its input and output. In this paper, for the first time, we have analyzed loading effect on leakage and proposed a method to accurately estimate the total leakage in a logic circuit, from its logic level description considering the impact of loading and transistor stacking.

preprint2007arXiv

Statistical Modeling of Pipeline Delay and Design of Pipeline under Process Variation to Enhance Yield in sub-100nm Technologies

Operating frequency of a pipelined circuit is determined by the delay of the slowest pipeline stage. However, under statistical delay variation in sub-100nm technology regime, the slowest stage is not readily identifiable and the estimation of the pipeline yield with respect to a target delay is a challenging problem. We have proposed analytical models to estimate yield for a pipelined design based on delay distributions of individual pipe stages. Using the proposed models, we have shown that change in logic depth and imbalance between the stage delays can improve the yield of a pipeline. A statistical methodology has been developed to optimally design a pipeline circuit for enhancing yield. Optimization results show that, proper imbalance among the stage delays in a pipeline improves design yield by 9% for the same area and performance (and area reduction by about 8.4% under a yield constraint) over a balanced design.

Kaushik Roy

What is connected

Connect this record

See the researcher in context

Building this map preview

79 published item(s)

ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models

MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning

EV-Planner: Energy-Efficient Robot Navigation via Event-Based Physics-Guided Neuromorphic Planner

A Co-design view of Compute in-Memory with Non-Volatile Elements for Neural Networks

Can Language Models Capture Graph Semantics? From Graphs to Language Model and Vice-Versa

Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts

Low Precision Decentralized Distributed Training over IID and non-IID Data

Non-Volume Preserving-based Fusion to Group-Level Emotion Recognition on Crowd Videos

Norm-Scaling for Out-of-Distribution Detection

Process Knowledge-Infused AI: Towards User-level Explainability, Interpretability, and Safety

Process Knowledge-infused Learning for Suicidality Assessment on Social Media

"Is depression related to cannabis?": A knowledge-infused model for Entity and Relation Extraction with Limited Supervision

Knowledge Infused Policy Gradients for Adaptive Pandemic Control

SPACE: Structured Compression and Sharing of Representational Space for Continual Learning

A Low Effort Approach to Structured CNN Design Using PCA

Conditionally Deep Hybrid Neural Networks Across Edge and Cloud

Enabling Deep Spiking Neural Networks with Hybrid Conversion and Spike Timing Dependent Backpropagation

Enabling Spike-based Backpropagation for Training Deep Neural Network Architectures

Explicitly Trained Spiking Sparsity in Spiking Neural Networks with Backpropagation

Fitted Q-Learning for Relational Domains

GENIEx: A Generalized Approach to Emulating Non-Ideality in Memristive Xbars using Neural Networks

Hyperparameter Optimization in Binary Communication Networks for Neuromorphic Deployment

IMAC: In-memory multi-bit Multiplication andACcumulation in 6T SRAM Array

Inclusive prompt photon-jet correlations as a probe of gluon saturation in electron-nucleus scattering at small $x$

Inherent Adversarial Robustness of Deep Spiking Neural Networks: Effects of Discrete Input Encoding and Non-Linear Activations

Relevant-features based Auxiliary Cells for Energy Efficient Detection of Natural Errors

RMP-SNN: Residual Membrane Potential Neuron for Enabling Deeper High-Accuracy and Low-Latency Spiking Neural Network

RxNN: A Framework for Evaluating Deep Neural Networks on Resistive Crossbars

sBSNN: Stochastic-Bits Enabled Binary Spiking Neural Network with On-Chip Learning for Energy Efficient Neuromorphic Computing at the Edge

Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks

Towards Understanding the Effect of Leak in Spiking Neural Networks

Vec2Face: Unveil Human Faces from their Blackbox Features in Face Recognition

Constructing Energy-efficient Mixed-precision Neural Networks through Principal Component Analysis for Edge Intelligence

Controlled Forgetting: Targeted Stimulation and Dopaminergic Plasticity Modulation for Unsupervised Lifelong Learning in Spiking Neural Networks

Extracting many-body correlators of saturated gluons with precision from inclusive photon+dijet final states in deeply inelastic scattering

Incremental Learning in Deep Convolutional Neural Networks Using Partial Network Sharing

NLO impact factor for inclusive photon$+$dijet production in $e+A$ DIS at small $x$

Attention Tree: Learning Hierarchies of Visual Features for Large-Scale Image Recognition

Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition

Energy-Efficient Object Detection using Semantic Decomposition

Hybrid Spintronic-CMOS Spiking Neural Network With On-Chip Learning: Devices, Circuits and Systems

Ising spin model using Spin-Hall Effect (SHE) induced magnetization reversal in Magnetic-Tunnel-Junction

MESL: Proposal for a Non-volatile Cascadable Magneto-Electric Spin Logic

Probabilistic Deep Spiking Neural Systems Enabled by Magnetic Tunnel Junction

Proposal for an All-Spin Artificial Neural Network: Emulating Neural and Synaptic Functionalities Through Domain Wall Motion in Ferromagnets

Unsupervised Regenerative Learning of Hierarchical Features in Spiking Deep Networks for Object Recognition

Yield, Area and Energy Optimization in Stt-MRAMs using failure aware ECC

Energy Efficient and High Performance Current-Mode Neural Network Circuit using Memristors and Digitally Assisted Analog CMOS Neurons

Exploring Spin-Transfer-Torque Devices for Logic Applications

High Sensitivity Biosensor using Injection Locked Spin Torque Nano-Oscillators

Modeling and Simulation of Spin Transfer Torque Generated at Topological Insulator/Ferromagnetic Heterostructure

Spin-Torque Sensors for Energy Efficient High Speed Long Interconnects

Design and Synthesis of Ultra Low Energy Spin-Memristor Threshold Logic

Hierarchical Temporal Memory Based on Spin-Neurons and Resistive Memory for Energy-Efficient Brain-Inspired Computing

Laser Induced Magnetization Reversal for Detection in Optical Interconnects

Multiple alignment of structures using center of proteins

Spin Orbit Torque Based Electronic Neuron

Spin-Orbit Torque Induced Spike-Timing Dependent Plasticity

STT-SNN: A Spin-Transfer-Torque Based Soft-Limiting Non-Linear Neuron for Low-Power Artificial Neural Networks

Boolean and Non-Boolean Computation With Spin Devices

DSTT-MRAM: Differential Spin Hall MRAM for On-chip Memories

Energy-Efficient and Robust Associative Computing with Electrically Coupled Dual Pillar Spin-Torque Oscillators

Exploring Boolean and Non-Boolean Computing Applications of Spin Torque Devices

Exploring Ultra Low-Power on-Chip Clocking Using Functionality Enhanced Spin-Torque Switches

Spin Neurons: A Possible Path to Energy-Efficient Neuromorphic Computers

Spintronic Switches for Ultra Low Energy On-Chip and Inter-Chip Current-Mode Interconnects

Ultra Low Energy Analog Image Processing Using Spin Neurons

Ultra Low Power Associative Computing with Spin Neurons and Resistive Crossbar Memory

Ultra-High Density, High-Performance and Energy-Efficient All Spin Logic

Ultra-low Energy, High Performance and Programmable Magnetic Threshold Logic

Ultra-low Energy, High-Performance Dynamic Resistive Threshold Logic

Proposal For Neuromorphic Hardware Using Spin Devices

Spin-Based Neuron Model with Domain Wall Magnets as Synapse