Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
57works
0followers
20topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

57 published item(s)

preprint2026arXiv

ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to $9.6\%$ and a $35\times$ smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model's zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.

preprint2026arXiv

HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models

Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. For example, changing a subject or adjusting an attribute often leads to unintended side effects, such as altered backgrounds or distorted details. This is because most existing text-based control methods treat the embedding space as Euclidean and apply simple linear transformations, which do not reflect how semantic concepts are actually organized. In this work, we take a step back and ask: what is the true geometry of these embeddings? We find that text encoder representations lie on a hypersphere, where concepts are not linear directions but structured, anisotropic distributions better captured by Kent distributions. Building on this insight, we propose HEART, a training-free framework that performs Kent-aware geodesic transformations directly on the hypersphere. By respecting the underlying geometry, HEART enables intuitive and precise edits, such as consistent subject replacement and fine-grained attribute control, while preserving the original scene. Importantly, HEART requires no finetuning, inversion, or optimization, and generalizes across diffusion model architectures. Our results show that a simple shift in perspective, from linear to spherical, can unlock fast, and controllable image generation.

preprint2026arXiv

MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning

In Online Continual Learning (OCL), a neural network sequentially learns from a non-stationary data stream in a single-pass with access only to a limited memory replay buffer. This contrasts sharply with off-line continual learning where training is multiple epoch dependent on large datasets. The main challenge faced by OCL is to overcome catastrophic forgetting of past tasks (stability) while learning new ones efficiently (plasticity). Existing methods counter forgetting via replay-based rehearsal, output level distillation, fixed regularization, or meta-learning on the current data. However, these methods have limitations: rehearsal introduces a stored sample bias; distillation operates on output-distributions without modulating parameter updates; fixed-regularization penalizes parameters irrespective of sensitivity; stream-only meta-learning lacks a feedback controlled parameter update. We propose Meta-Adaptive Network Gradient Optimization (MANGO), an OCL framework that balances stability-plasticity via gradient-gating and meta-learned regularization. Gradient-gating scales parameter updates based on sensitivity, preventing destructive updates. Meta-learned regularization adapts stability coefficients, evaluating the effect of parameter update on replay. In MANGO, replay acts as both a training signal and a forgetting evaluator. We evaluated our method on three standard OCL benchmark datasets. MANGO outperforms strong baselines, achieving state-of-the-art results with consistent performance across replay sizes. In domain incremental learning on CLEAR-10 and class incremental learning on CIFAR-100 and Tiny-ImageNet, it achieves highest accuracy among all baselines and achieves positive Backward Transfer, overcoming forgetting on CLEAR-10.

preprint2024arXiv

EV-Planner: Energy-Efficient Robot Navigation via Event-Based Physics-Guided Neuromorphic Planner

Vision-based object tracking is an essential precursor to performing autonomous aerial navigation in order to avoid obstacles. Biologically inspired neuromorphic event cameras are emerging as a powerful alternative to frame-based cameras, due to their ability to asynchronously detect varying intensities (even in poor lighting conditions), high dynamic range, and robustness to motion blur. Spiking neural networks (SNNs) have gained traction for processing events asynchronously in an energy-efficient manner. On the other hand, physics-based artificial intelligence (AI) has gained prominence recently, as they enable embedding system knowledge via physical modeling inside traditional analog neural networks (ANNs). In this letter, we present an event-based physics-guided neuromorphic planner (EV-Planner) to perform obstacle avoidance using neuromorphic event cameras and physics-based AI. We consider the task of autonomous drone navigation where the mission is to detect moving gates and fly through them while avoiding a collision. We use event cameras to perform object detection using a shallow spiking neural network in an unsupervised fashion. Utilizing the physical equations of the brushless DC motors present in the drone rotors, we train a lightweight energy-aware physics-guided neural network (PgNN) with depth inputs. This predicts the optimal flight time responsible for generating near-minimum energy paths. We spawn the drone in the Gazebo simulator and implement a sensor-fused vision-to-planning neuro-symbolic framework using Robot Operating System (ROS). Simulation results for safe collision-free flight trajectories are presented with performance analysis, ablation study and potential future research directions

preprint2022arXiv

A Co-design view of Compute in-Memory with Non-Volatile Elements for Neural Networks

Deep Learning neural networks are pervasive, but traditional computer architectures are reaching the limits of being able to efficiently execute them for the large workloads of today. They are limited by the von Neumann bottleneck: the high cost in energy and latency incurred in moving data between memory and the compute engine. Today, special CMOS designs address this bottleneck. The next generation of computing hardware will need to eliminate or dramatically mitigate this bottleneck. We discuss how compute-in-memory can play an important part in this development. Here, a non-volatile memory based cross-bar architecture forms the heart of an engine that uses an analog process to parallelize the matrix vector multiplication operation, repeatedly used in all neural network workloads. The cross-bar architecture, at times referred to as a neuromorphic approach, can be a key hardware element in future computing machines. In the first part of this review we take a co-design view of the design constraints and the demands it places on the new materials and memory devices that anchor the cross-bar architecture. In the second part, we review what is knows about the different new non-volatile memory materials and devices suited for compute in-memory, and discuss the outlook and challenges.

preprint2022arXiv

Can Language Models Capture Graph Semantics? From Graphs to Language Model and Vice-Versa

Knowledge Graphs are a great resource to capture semantic knowledge in terms of entities and relationships between the entities. However, current deep learning models takes as input distributed representations or vectors. Thus, the graph is compressed in a vectorized representation. We conduct a study to examine if the deep learning model can compress a graph and then output the same graph with most of the semantics intact. Our experiments show that Transformer models are not able to express the full semantics of the input knowledge graph. We find that this is due to the disparity between the directed, relationship and type based information contained in a Knowledge Graph and the fully connected token-token undirected graphical interpretation of the Transformer Attention matrix.

preprint2022arXiv

Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts

Conversational Agents (CAs) powered with deep language models (DLMs) have shown tremendous promise in the domain of mental health. Prominently, the CAs have been used to provide informational or therapeutic services to patients. However, the utility of CAs to assist in mental health triaging has not been explored in the existing work as it requires a controlled generation of follow-up questions (FQs), which are often initiated and guided by the mental health professionals (MHPs) in clinical settings. In the context of depression, our experiments show that DLMs coupled with process knowledge in a mental health questionnaire generate 12.54% and 9.37% better FQs based on similarity and longest common subsequence matches to questions in the PHQ-9 dataset respectively, when compared with DLMs without process knowledge support. Despite coupling with process knowledge, we find that DLMs are still prone to hallucination, i.e., generating redundant, irrelevant, and unsafe FQs. We demonstrate the challenge of using existing datasets to train a DLM for generating FQs that adhere to clinical process knowledge. To address this limitation, we prepared an extended PHQ-9 based dataset, PRIMATE, in collaboration with MHPs. PRIMATE contains annotations regarding whether a particular question in the PHQ-9 dataset has already been answered in the user's initial description of the mental health condition. We used PRIMATE to train a DLM in a supervised setting to identify which of the PHQ-9 questions can be answered directly from the user's post and which ones would require more information from the user. Using performance analysis based on MCC scores, we show that PRIMATE is appropriate for identifying questions in PHQ-9 that could guide generative DLMs towards controlled FQ generation suitable for aiding triaging. Dataset created as a part of this research: https://github.com/primate-mh/Primate2022

preprint2022arXiv

Low Precision Decentralized Distributed Training over IID and non-IID Data

Decentralized distributed learning is the key to enabling large-scale machine learning (training) on edge devices utilizing private user-generated local data, without relying on the cloud. However, the practical realization of such on-device training is limited by the communication and compute bottleneck. In this paper, we propose and show the convergence of low precision decentralized training that aims to reduce the computational complexity and communication cost of decentralized training. Many feedback-based compression techniques have been proposed in the literature to reduce communication costs. To the best of our knowledge, there is no work that applies and shows compute efficient training techniques such as quantization, pruning, etc., for peer-to-peer decentralized learning setups. Since real-world applications have a significant skew in the data distribution, we design "Range-EvoNorm" as the normalization activation layer which is better suited for low precision training over non-IID data. Moreover, we show that the proposed low precision training can be used in synergy with other communication compression methods decreasing the communication cost further. Our experiments indicate that 8-bit decentralized training has minimal accuracy loss compared to its full precision counterpart even with non-IID data. However, when low precision training is accompanied by communication compression through sparsification we observe a 1-2% drop in accuracy. The proposed low precision decentralized training decreases computational complexity, memory usage, and communication cost by 4x and compute energy by a factor of ~20x, while trading off less than a $1\%$ accuracy for both IID and non-IID data. In particular, with higher skew values, we observe an increase in accuracy (by ~ 0.5%) with low precision training, indicating the regularization effect of the quantization.

preprint2022arXiv

Non-Volume Preserving-based Fusion to Group-Level Emotion Recognition on Crowd Videos

Group-level emotion recognition (ER) is a growing research area as the demands for assessing crowds of all sizes are becoming an interest in both the security arena as well as social media. This work extends the earlier ER investigations, which focused on either group-level ER on single images or within a video, by fully investigating group-level expression recognition on crowd videos. In this paper, we propose an effective deep feature level fusion mechanism to model the spatial-temporal information in the crowd videos. In our approach, the fusing process is performed on the deep feature domain by a generative probabilistic model, Non-Volume Preserving Fusion (NVPF), that models spatial information relationships. Furthermore, we extend our proposed spatial NVPF approach to the spatial-temporal NVPF approach to learn the temporal information between frames. To demonstrate the robustness and effectiveness of each component in the proposed approach, three experiments were conducted: (i) evaluation on AffectNet database to benchmark the proposed EmoNet for recognizing facial expression; (ii) evaluation on EmotiW2018 to benchmark the proposed deep feature level fusion mechanism NVPF; and, (iii) examine the proposed TNVPF on an innovative Group-level Emotion on Crowd Videos (GECV) dataset composed of 627 videos collected from publicly available sources. GECV dataset is a collection of videos containing crowds of people. Each video is labeled with emotion categories at three levels: individual faces, group of people, and the entire video frame.

preprint2022arXiv

Norm-Scaling for Out-of-Distribution Detection

Out-of-Distribution (OoD) inputs are examples that do not belong to the true underlying distribution of the dataset. Research has shown that deep neural nets make confident mispredictions on OoD inputs. Therefore, it is critical to identify OoD inputs for safe and reliable deployment of deep neural nets. Often a threshold is applied on a similarity score to detect OoD inputs. One such similarity is angular similarity which is the dot product of latent representation with the mean class representation. Angular similarity encodes uncertainty, for example, if the angular similarity is less, it is less certain that the input belongs to that class. However, we observe that, different classes have different distributions of angular similarity. Therefore, applying a single threshold for all classes is not ideal since the same similarity score represents different uncertainties for different classes. In this paper, we propose norm-scaling which normalizes the logits separately for each class. This ensures that a single value consistently represents similar uncertainty for various classes. We show that norm-scaling, when used with maximum softmax probability detector, achieves 9.78% improvement in AUROC, 5.99% improvement in AUPR and 33.19% reduction in FPR95 metrics over previous state-of-the-art methods.

preprint2022arXiv

Process Knowledge-Infused AI: Towards User-level Explainability, Interpretability, and Safety

AI systems have been widely adopted across various domains in the real world. However, in high-value, sensitive, or safety-critical applications such as self-management for personalized health or food recommendation with a specific purpose (e.g., allergy-aware recipe recommendations), their adoption is unlikely. Firstly, the AI system needs to follow guidelines or well-defined processes set by experts; the data alone will not be adequate. For example, to diagnose the severity of depression, mental healthcare providers use Patient Health Questionnaire (PHQ-9). So if an AI system were to be used for diagnosis, the medical guideline implied by the PHQ-9 needs to be used. Likewise, a nutritionist's knowledge and steps would need to be used for an AI system that guides a diabetic patient in developing a food plan. Second, the BlackBox nature typical of many current AI systems will not work; the user of an AI system will need to be able to give user-understandable explanations, explanations constructed using concepts that humans can understand and are familiar with. This is the key to eliciting confidence and trust in the AI system. For such applications, in addition to data and domain knowledge, the AI systems need to have access to and use the Process Knowledge, an ordered set of steps that the AI system needs to use or adhere to.

preprint2022arXiv

Process Knowledge-infused Learning for Suicidality Assessment on Social Media

Improving the performance and natural language explanations of deep learning algorithms is a priority for adoption by humans in the real world. In several domains, such as healthcare, such technology has significant potential to reduce the burden on humans by providing quality assistance at scale. However, current methods rely on the traditional pipeline of predicting labels from data, thus completely ignoring the process and guidelines used to obtain the labels. Furthermore, post hoc explanations on the data to label prediction using explainable AI (XAI) models, while satisfactory to computer scientists, leave much to be desired to the end-users due to lacking explanations of the process in terms of human-understandable concepts. We \textit{introduce}, \textit{formalize}, and \textit{develop} a novel Artificial Intelligence (A) paradigm -- Process Knowledge-infused Learning (PK-iL). PK-iL utilizes a structured process knowledge that explicitly explains the underlying prediction process that makes sense to end-users. The qualitative human evaluation confirms through a annotator agreement of 0.72, that humans are understand explanations for the predictions. PK-iL also performs competitively with the state-of-the-art (SOTA) baselines.

preprint2021arXiv

"Is depression related to cannabis?": A knowledge-infused model for Entity and Relation Extraction with Limited Supervision

With strong marketing advocacy of the benefits of cannabis use for improved mental health, cannabis legalization is a priority among legislators. However, preliminary scientific research does not conclusively associate cannabis with improved mental health. In this study, we explore the relationship between depression and consumption of cannabis in a targeted social media corpus involving personal use of cannabis with the intent to derive its potential mental health benefit. We use tweets that contain an association among three categories annotated by domain experts - Reason, Effect, and Addiction. The state-of-the-art Natural Langauge Processing techniques fall short in extracting these relationships between cannabis phrases and the depression indicators. We seek to address the limitation by using domain knowledge; specifically, the Drug Abuse Ontology for addiction augmented with Diagnostic and Statistical Manual of Mental Disorders lexicons for mental health. Because of the lack of annotations due to the limited availability of the domain experts' time, we use supervised contrastive learning in conjunction with GPT-3 trained on a vast corpus to achieve improved performance even with limited supervision. Experimental results show that our method can significantly extract cannabis-depression relationships better than the state-of-the-art relation extractor. High-quality annotations can be provided using a nearest neighbor approach using the learned representations that can be used by the scientific community to understand the association between cannabis and depression better.

preprint2021arXiv

Knowledge Infused Policy Gradients for Adaptive Pandemic Control

COVID-19 has impacted nations differently based on their policy implementations. The effective policy requires taking into account public information and adaptability to new knowledge. Epidemiological models built to understand COVID-19 seldom provide the policymaker with the capability for adaptive pandemic control (APC). Among the core challenges to be overcome include (a) inability to handle a high degree of non-homogeneity in different contributing features across the pandemic timeline, (b) lack of an approach that enables adaptive incorporation of public health expert knowledge, and (c) transparent models that enable understanding of the decision-making process in suggesting policy. In this work, we take the early steps to address these challenges using Knowledge Infused Policy Gradient (KIPG) methods. Prior work on knowledge infusion does not handle soft and hard imposition of varying forms of knowledge in disease information and guidelines to necessarily comply with. Furthermore, the models do not attend to non-homogeneity in feature counts, manifesting as partial observability in informing the policy. Additionally, interpretable structures are extracted post-learning instead of learning an interpretable model required for APC. To this end, we introduce a mathematical framework for KIPG methods that can (a) induce relevant feature counts over multi-relational features of the world, (b) handle latent non-homogeneous counts as hidden variables that are linear combinations of kernelized aggregates over the features, and (b) infuse knowledge as functional constraints in a principled manner. The study establishes a theory for imposing hard and soft constraints and simulates it through experiments. In comparison with knowledge-intensive baselines, we show quick sample efficient adaptation to new knowledge and interpretability in the learned policy, especially in a pandemic context.

preprint2021arXiv

SPACE: Structured Compression and Sharing of Representational Space for Continual Learning

Humans learn adaptively and efficiently throughout their lives. However, incrementally learning tasks causes artificial neural networks to overwrite relevant information learned about older tasks, resulting in 'Catastrophic Forgetting'. Efforts to overcome this phenomenon often utilize resources poorly, for instance, by growing the network architecture or needing to save parametric importance scores, or violate data privacy between tasks. To tackle this, we propose SPACE, an algorithm that enables a network to learn continually and efficiently by partitioning the learnt space into a Core space, that serves as the condensed knowledge base over previously learned tasks, and a Residual space, which is akin to a scratch space for learning the current task. After learning each task, the Residual is analyzed for redundancy, both within itself and with the learnt Core space. A minimal number of extra dimensions required to explain the current task are added to the Core space and the remaining Residual is freed up for learning the next task. We evaluate our algorithm on P-MNIST, CIFAR and a sequence of 8 different datasets, and achieve comparable accuracy to the state-of-the-art methods while overcoming catastrophic forgetting. Additionally, our algorithm is well suited for practical use. The partitioning algorithm analyzes all layers in one shot, ensuring scalability to deeper networks. Moreover, the analysis of dimensions translates to filter-level sparsity, and the structured nature of the resulting architecture gives us up to 5x improvement in energy efficiency during task inference over the current state-of-the-art.

preprint2020arXiv

A Low Effort Approach to Structured CNN Design Using PCA

Deep learning models hold state of the art performance in many fields, yet their design is still based on heuristics or grid search methods that often result in overparametrized networks. This work proposes a method to analyze a trained network and deduce an optimized, compressed architecture that preserves accuracy while keeping computational costs tractable. Model compression is an active field of research that targets the problem of realizing deep learning models in hardware. However, most pruning methodologies tend to be experimental, requiring large compute and time intensive iterations of retraining the entire network. We introduce structure into model design by proposing a single shot analysis of a trained network that serves as a first order, low effort approach to dimensionality reduction, by using PCA (Principal Component Analysis). The proposed method simultaneously analyzes the activations of each layer and considers the dimensionality of the space described by the filters generating these activations. It optimizes the architecture in terms of number of layers, and number of filters per layer without any iterative retraining procedures, making it a viable, low effort technique to design efficient networks. We demonstrate the proposed methodology on AlexNet and VGG style networks on the CIFAR-10, CIFAR-100 and ImageNet datasets, and successfully achieve an optimized architecture with a reduction of up to 3.8X and 9X in the number of operations and parameters respectively, while trading off less than 1% accuracy. We also apply the method to MobileNet, and achieve 1.7X and 3.9X reduction in the number of operations and parameters respectively, while improving accuracy by almost one percentage point.

preprint2020arXiv

Conditionally Deep Hybrid Neural Networks Across Edge and Cloud

The pervasiveness of "Internet-of-Things" in our daily life has led to a recent surge in fog computing, encompassing a collaboration of cloud computing and edge intelligence. To that effect, deep learning has been a major driving force towards enabling such intelligent systems. However, growing model sizes in deep learning pose a significant challenge towards deployment in resource-constrained edge devices. Moreover, in a distributed intelligence environment, efficient workload distribution is necessary between edge and cloud systems. To address these challenges, we propose a conditionally deep hybrid neural network for enabling AI-based fog computing. The proposed network can be deployed in a distributed manner, consisting of quantized layers and early exits at the edge and full-precision layers on the cloud. During inference, if an early exit has high confidence in the classification results, it would allow samples to exit at the edge, and the deeper layers on the cloud are activated conditionally, which can lead to improved energy efficiency and inference latency. We perform an extensive design space exploration with the goal of minimizing energy consumption at the edge while achieving state-of-the-art classification accuracies on image classification tasks. We show that with binarized layers at the edge, the proposed conditional hybrid network can process 65% of inferences at the edge, leading to 5.5x computational energy reduction with minimal accuracy degradation on CIFAR-10 dataset. For the more complex dataset CIFAR-100, we observe that the proposed network with 4-bit quantization at the edge achieves 52% early classification at the edge with 4.8x energy reduction. The analysis gives us insights on designing efficient hybrid networks which achieve significantly higher energy efficiency than full-precision networks for edge-cloud based distributed intelligence systems.

preprint2020arXiv

Enabling Deep Spiking Neural Networks with Hybrid Conversion and Spike Timing Dependent Backpropagation

Spiking Neural Networks (SNNs) operate with asynchronous discrete events (or spikes) which can potentially lead to higher energy-efficiency in neuromorphic hardware implementations. Many works have shown that an SNN for inference can be formed by copying the weights from a trained Artificial Neural Network (ANN) and setting the firing threshold for each layer as the maximum input received in that layer. These type of converted SNNs require a large number of time steps to achieve competitive accuracy which diminishes the energy savings. The number of time steps can be reduced by training SNNs with spike-based backpropagation from scratch, but that is computationally expensive and slow. To address these challenges, we present a computationally-efficient training technique for deep SNNs. We propose a hybrid training methodology: 1) take a converted SNN and use its weights and thresholds as an initialization step for spike-based backpropagation, and 2) perform incremental spike-timing dependent backpropagation (STDB) on this carefully initialized network to obtain an SNN that converges within few epochs and requires fewer time steps for input processing. STDB is performed with a novel surrogate gradient function defined using neuron's spike time. The proposed training methodology converges in less than 20 epochs of spike-based backpropagation for most standard image classification datasets, thereby greatly reducing the training complexity compared to training SNNs from scratch. We perform experiments on CIFAR-10, CIFAR-100, and ImageNet datasets for both VGG and ResNet architectures. We achieve top-1 accuracy of 65.19% for ImageNet dataset on SNN with 250 time steps, which is 10X faster compared to converted SNNs with similar accuracy.

preprint2020arXiv

Enabling Spike-based Backpropagation for Training Deep Neural Network Architectures

Spiking Neural Networks (SNNs) have recently emerged as a prominent neural computing paradigm. However, the typical shallow SNN architectures have limited capacity for expressing complex representations while training deep SNNs using input spikes has not been successful so far. Diverse methods have been proposed to get around this issue such as converting off-the-shelf trained deep Artificial Neural Networks (ANNs) to SNNs. However, the ANN-SNN conversion scheme fails to capture the temporal dynamics of a spiking system. On the other hand, it is still a difficult problem to directly train deep SNNs using input spike events due to the discontinuous, non-differentiable nature of the spike generation function. To overcome this problem, we propose an approximate derivative method that accounts for the leaky behavior of LIF neurons. This method enables training deep convolutional SNNs directly (with input spike events) using spike-based backpropagation. Our experiments show the effectiveness of the proposed spike-based learning on deep networks (VGG and Residual architectures) by achieving the best classification accuracies in MNIST, SVHN and CIFAR-10 datasets compared to other SNNs trained with a spike-based learning. Moreover, we analyze sparse event-based computations to demonstrate the efficacy of the proposed SNN training method for inference operation in the spiking domain.

preprint2020arXiv

Explicitly Trained Spiking Sparsity in Spiking Neural Networks with Backpropagation

Spiking Neural Networks (SNNs) are being explored for their potential energy efficiency resulting from sparse, event-driven computations. Many recent works have demonstrated effective backpropagation for deep Spiking Neural Networks (SNNs) by approximating gradients over discontinuous neuron spikes or firing events. A beneficial side-effect of these surrogate gradient spiking backpropagation algorithms is that the spikes, which trigger additional computations, may now themselves be directly considered in the gradient calculations. We propose an explicit inclusion of spike counts in the loss function, along with a traditional error loss, causing the backpropagation learning algorithms to optimize weight parameters for both accuracy and spiking sparsity. As supported by existing theory of over-parameterized neural networks, there are many solution states with effectively equivalent accuracy. As such, appropriate weighting of the two loss goals during training in this multi-objective optimization process can yield an improvement in spiking sparsity without a significant loss of accuracy. We additionally explore a simulated annealing-inspired loss weighting technique to increase the weighting for sparsity as training time increases. Our preliminary results on the Cifar-10 dataset show up to 70.1% reduction in spiking activity with iso-accuracy compared to an equivalent SNN trained only for accuracy and up to 73.3% reduction in spiking activity if allowed a trade-off of 1% reduction in classification accuracy.

preprint2020arXiv

Fitted Q-Learning for Relational Domains

We consider the problem of Approximate Dynamic Programming in relational domains. Inspired by the success of fitted Q-learning methods in propositional settings, we develop the first relational fitted Q-learning algorithms by representing the value function and Bellman residuals. When we fit the Q-functions, we show how the two steps of Bellman operator; application and projection steps can be performed using a gradient-boosting technique. Our proposed framework performs reasonably well on standard domains without using domain models and using fewer training trajectories.

preprint2020arXiv

GENIEx: A Generalized Approach to Emulating Non-Ideality in Memristive Xbars using Neural Networks

The analog nature of computing in Memristive crossbars poses significant issues due to various non-idealities such as: parasitic resistances, non-linear I-V characteristics of the device etc. The non-idealities can have a detrimental impact on the functionality i.e. computational accuracy of crossbars. Past works have explored modeling the non-idealities using analytical techniques. However, several non-idealities have data dependent behavior. This can not be captured using analytical (non data-dependent) models thereby, limiting their suitability in predicting application accuracy. To address this, we propose a Generalized Approach to Emulating Non-Ideality in Memristive Crossbars using Neural Networks (GENIEx), which accurately captures the data-dependent nature of non-idealities. We perform extensive HSPICE simulations of crossbars with different voltage and conductance combinations. Following that, we train a neural network to learn the transfer characteristics of the non-ideal crossbar. Next, we build a functional simulator which includes key architectural facets such as \textit{tiling}, and \textit{bit-slicing} to analyze the impact of non-idealities on the classification accuracy of large-scale neural networks. We show that GENIEx achieves \textit{low} root mean square errors (RMSE) of $0.25$ and $0.7$ for low and high voltages, respectively, compared to HSPICE. Additionally, the GENIEx errors are $7\times$ and $12.8\times$ better than an analytical model which can only capture the linear non-idealities. Further, using the functional simulator and GENIEx, we demonstrate that an analytical model can overestimate the degradation in classification accuracy by $\ge 10\%$ on CIFAR-100 and $3.7\%$ on ImageNet datasets compared to GENIEx.

preprint2020arXiv

Hyperparameter Optimization in Binary Communication Networks for Neuromorphic Deployment

Training neural networks for neuromorphic deployment is non-trivial. There have been a variety of approaches proposed to adapt back-propagation or back-propagation-like algorithms appropriate for training. Considering that these networks often have very different performance characteristics than traditional neural networks, it is often unclear how to set either the network topology or the hyperparameters to achieve optimal performance. In this work, we introduce a Bayesian approach for optimizing the hyperparameters of an algorithm for training binary communication networks that can be deployed to neuromorphic hardware. We show that by optimizing the hyperparameters on this algorithm for each dataset, we can achieve improvements in accuracy over the previous state-of-the-art for this algorithm on each dataset (by up to 15 percent). This jump in performance continues to emphasize the potential when converting traditional neural networks to binary communication applicable to neuromorphic hardware.

preprint2020arXiv

IMAC: In-memory multi-bit Multiplication andACcumulation in 6T SRAM Array

`In-memory computing' is being widely explored as a novel computing paradigm to mitigate the well known memory bottleneck. This emerging paradigm aims at embedding some aspects of computations inside the memory array, thereby avoiding frequent and expensive movement of data between the compute unit and the storage memory. In-memory computing with respect to Silicon memories has been widely explored on various memory bit-cells. Embedding computation inside the 6 transistor (6T) SRAM array is of special interest since it is the most widely used on-chip memory. In this paper, we present a novel in-memory multiplication followed by accumulation operation capable of performing parallel dot products within 6T SRAM without any changes to the standard bitcell. We, further, study the effect of circuit non-idealities and process variations on the accuracy of the LeNet-5 and VGG neural network architectures against the MNIST and CIFAR-10 datasets, respectively. The proposed in-memory dot-product mechanism achieves 88.8% and 99% accuracy for the CIFAR-10 and MNIST, respectively. Compared to the standard von Neumann system, the proposed system is 6.24x better in energy consumption and 9.42x better in delay.

preprint2020arXiv

Inclusive prompt photon-jet correlations as a probe of gluon saturation in electron-nucleus scattering at small $x$

We compute the differential cross-section for inclusive prompt photon$+$quark production in deeply inelastic scattering of electrons off nuclei at small $x$ ($e+A$ DIS) in the framework of the Color Glass Condensate effective field theory. The result is expressed as a convolution of the leading order (in the strong coupling $α_{\mathrm{s}}$) impact factor for the process and universal dipole matrix elements, in the limit of hard photon transverse momentum relative to the nuclear saturation scale $Q_{s,A}(x)$. We perform a numerical study of this process for the kinematics of the Electron-Ion Collider (EIC), exploring in particular the azimuthal angle correlations between the final state photon and quark. We observe a systematic suppression and broadening pattern of the back-to-back peak in the relative azimuthal angle distribution, as the saturation scale is increased by replacing proton targets with gold nuclei. Our results suggest that photon+jet final states in inclusive $e+A$ DIS at high energies are in general a promising channel for exploring gluon saturation that is complementary to inclusive and diffractive dijet production. They also provide a sensitive empirical test of the universality of dipole matrix elements when compared to identical measurements in proton-nucleus collisions. However because photon+jet correlations at small $x$ in EIC kinematics require jet reconstruction at small $k_\perp$, it will be important to study their feasibility relative to photon-hadron correlations.

preprint2020arXiv

Inherent Adversarial Robustness of Deep Spiking Neural Networks: Effects of Discrete Input Encoding and Non-Linear Activations

In the recent quest for trustworthy neural networks, we present Spiking Neural Network (SNN) as a potential candidate for inherent robustness against adversarial attacks. In this work, we demonstrate that adversarial accuracy of SNNs under gradient-based attacks is higher than their non-spiking counterparts for CIFAR datasets on deep VGG and ResNet architectures, particularly in blackbox attack scenario. We attribute this robustness to two fundamental characteristics of SNNs and analyze their effects. First, we exhibit that input discretization introduced by the Poisson encoder improves adversarial robustness with reduced number of timesteps. Second, we quantify the amount of adversarial accuracy with increased leak rate in Leaky-Integrate-Fire (LIF) neurons. Our results suggest that SNNs trained with LIF neurons and smaller number of timesteps are more robust than the ones with IF (Integrate-Fire) neurons and larger number of timesteps. Also we overcome the bottleneck of creating gradient-based adversarial inputs in temporal domain by proposing a technique for crafting attacks from SNN

preprint2020arXiv

Relevant-features based Auxiliary Cells for Energy Efficient Detection of Natural Errors

Deep neural networks have demonstrated state-of-the-art performance on many classification tasks. However, they have no inherent capability to recognize when their predictions are wrong. There have been several efforts in the recent past to detect natural errors but the suggested mechanisms pose additional energy requirements. To address this issue, we propose an ensemble of classifiers at hidden layers to enable energy efficient detection of natural errors. In particular, we append Relevant-features based Auxiliary Cells (RACs) which are class specific binary linear classifiers trained on relevant features. The consensus of RACs is used to detect natural errors. Based on combined confidence of RACs, classification can be terminated early, thereby resulting in energy efficient detection. We demonstrate the effectiveness of our technique on various image classification datasets such as CIFAR-10, CIFAR-100 and Tiny-ImageNet.

preprint2020arXiv

RMP-SNN: Residual Membrane Potential Neuron for Enabling Deeper High-Accuracy and Low-Latency Spiking Neural Network

Spiking Neural Networks (SNNs) have recently attracted significant research interest as the third generation of artificial neural networks that can enable low-power event-driven data analytics. The best performing SNNs for image recognition tasks are obtained by converting a trained Analog Neural Network (ANN), consisting of Rectified Linear Units (ReLU), to SNN composed of integrate-and-fire neurons with "proper" firing thresholds. The converted SNNs typically incur loss in accuracy compared to that provided by the original ANN and require sizable number of inference time-steps to achieve the best accuracy. We find that performance degradation in the converted SNN stems from using "hard reset" spiking neuron that is driven to fixed reset potential once its membrane potential exceeds the firing threshold, leading to information loss during SNN inference. We propose ANN-SNN conversion using "soft reset" spiking neuron model, referred to as Residual Membrane Potential (RMP) spiking neuron, which retains the "residual" membrane potential above threshold at the firing instants. We demonstrate near loss-less ANN-SNN conversion using RMP neurons for VGG-16, ResNet-20, and ResNet-34 SNNs on challenging datasets including CIFAR-10 (93.63% top-1), CIFAR-100 (70.93% top-1), and ImageNet (73.09% top-1 accuracy). Our results also show that RMP-SNN surpasses the best inference accuracy provided by the converted SNN with "hard reset" spiking neurons using 2-8 times fewer inference time-steps across network architectures and datasets.

preprint2020arXiv

RxNN: A Framework for Evaluating Deep Neural Networks on Resistive Crossbars

Resistive crossbars designed with non-volatile memory devices have emerged as promising building blocks for Deep Neural Network (DNN) hardware, due to their ability to compactly and efficiently realize vector-matrix multiplication (VMM), the dominant computational kernel in DNNs. However, a key challenge with resistive crossbars is that they suffer from a range of device and circuit level non-idealities such as interconnect parasitics, peripheral circuits, sneak paths, and process variations. These non-idealities can lead to errors in VMMs, eventually degrading the DNN's accuracy. It is therefore critical to study the impact of crossbar non-idealities on the accuracy of large-scale DNNs. However, this is challenging because existing device and circuit models are too slow to use in application-level evaluations. We present RxNN, a fast and accurate simulation framework to evaluate large-scale DNNs on resistive crossbar systems. RxNN splits and maps the computations involved in each DNN layer into crossbar operations, and evaluates them using a Fast Crossbar Model (FCM) that accurately captures the errors arising due to crossbar non-idealities while being four-to-five orders of magnitude faster than circuit simulation. FCM models a crossbar-based VMM operation using three stages - non-linear models for the input and output peripheral circuits (DACs and ADCs), and an equivalent non-ideal conductance matrix for the core crossbar array. We implement RxNN by extending the Caffe machine learning framework and use it to evaluate a suite of six large-scale DNNs developed for the ImageNet Challenge. Our experiments reveal that resistive crossbar non-idealities can lead to significant accuracy degradations (9.6%-32%) for these large-scale DNNs. To the best of our knowledge, this work is the first quantitative evaluation of the accuracy of large-scale DNNs on resistive crossbar based hardware.

preprint2020arXiv

sBSNN: Stochastic-Bits Enabled Binary Spiking Neural Network with On-Chip Learning for Energy Efficient Neuromorphic Computing at the Edge

In this work, we propose stochastic Binary Spiking Neural Network (sBSNN) composed of stochastic spiking neurons and binary synapses (stochastic only during training) that computes probabilistically with one-bit precision for power-efficient and memory-compressed neuromorphic computing. We present an energy-efficient implementation of the proposed sBSNN using 'stochastic bit' as the core computational primitive to realize the stochastic neurons and synapses, which are fabricated in 90nm CMOS process, to achieve efficient on-chip training and inference for image recognition tasks. The measured data shows that the 'stochastic bit' can be programmed to mimic spiking neurons, and stochastic Spike Timing Dependent Plasticity (or sSTDP) rule for training the binary synaptic weights without expensive random number generators. Our results indicate that the proposed sBSNN realization offers possibility of up to 32x neuronal and synaptic memory compression compared to full precision (32-bit) SNN and energy efficiency of 89.49 TOPS/Watt for two-layer fully-connected SNN.

preprint2020arXiv

Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks

Event-based cameras display great potential for a variety of tasks such as high-speed motion detection and navigation in low-light environments where conventional frame-based cameras suffer critically. This is attributed to their high temporal resolution, high dynamic range, and low-power consumption. However, conventional computer vision methods as well as deep Analog Neural Networks (ANNs) are not suited to work well with the asynchronous and discrete nature of event camera outputs. Spiking Neural Networks (SNNs) serve as ideal paradigms to handle event camera outputs, but deep SNNs suffer in terms of performance due to the spike vanishing phenomenon. To overcome these issues, we present Spike-FlowNet, a deep hybrid neural network architecture integrating SNNs and ANNs for efficiently estimating optical flow from sparse event camera outputs without sacrificing the performance. The network is end-to-end trained with self-supervised learning on Multi-Vehicle Stereo Event Camera (MVSEC) dataset. Spike-FlowNet outperforms its corresponding ANN-based method in terms of the optical flow prediction capability while providing significant computational efficiency.

preprint2020arXiv

Towards Understanding the Effect of Leak in Spiking Neural Networks

Spiking Neural Networks (SNNs) are being explored to emulate the astounding capabilities of human brain that can learn and compute functions robustly and efficiently with noisy spiking activities. A variety of spiking neuron models have been proposed to resemble biological neuronal functionalities. With varying levels of bio-fidelity, these models often contain a leak path in their internal states, called membrane potentials. While the leaky models have been argued as more bioplausible, a comparative analysis between models with and without leak from a purely computational point of view demands attention. In this paper, we investigate the questions regarding the justification of leak and the pros and cons of using leaky behavior. Our experimental results reveal that leaky neuron model provides improved robustness and better generalization compared to models with no leak. However, leak decreases the sparsity of computation contrary to the common notion. Through a frequency domain analysis, we demonstrate the effect of leak in eliminating the high-frequency components from the input, thus enabling SNNs to be more robust against noisy spike-inputs.

preprint2020arXiv

Vec2Face: Unveil Human Faces from their Blackbox Features in Face Recognition

Unveiling face images of a subject given his/her high-level representations extracted from a blackbox Face Recognition engine is extremely challenging. It is because the limitations of accessible information from that engine including its structure and uninterpretable extracted features. This paper presents a novel generative structure with Bijective Metric Learning, namely Bijective Generative Adversarial Networks in a Distillation framework (DiBiGAN), for synthesizing faces of an identity given that person's features. In order to effectively address this problem, this work firstly introduces a bijective metric so that the distance measurement and metric learning process can be directly adopted in image domain for an image reconstruction task. Secondly, a distillation process is introduced to maximize the information exploited from the blackbox face recognition engine. Then a Feature-Conditional Generator Structure with Exponential Weighting Strategy is presented for a more robust generator that can synthesize realistic faces with ID preservation. Results on several benchmarking datasets including CelebA, LFW, AgeDB, CFP-FP against matching engines have demonstrated the effectiveness of DiBiGAN on both image realism and ID preservation properties.

preprint2019arXiv

Constructing Energy-efficient Mixed-precision Neural Networks through Principal Component Analysis for Edge Intelligence

The `Internet of Things' has brought increased demand for AI-based edge computing in applications ranging from healthcare monitoring systems to autonomous vehicles. Quantization is a powerful tool to address the growing computational cost of such applications, and yields significant compression over full-precision networks. However, quantization can result in substantial loss of performance for complex image classification tasks. To address this, we propose a Principal Component Analysis (PCA) driven methodology to identify the important layers of a binary network, and design mixed-precision networks. The proposed Hybrid-Net achieves a more than 10% improvement in classification accuracy over binary networks such as XNOR-Net for ResNet and VGG architectures on CIFAR-100 and ImageNet datasets while still achieving up to 94% of the energy-efficiency of XNOR-Nets. This work furthers the feasibility of using highly compressed neural networks for energy-efficient neural computing in edge devices.

preprint2019arXiv

Controlled Forgetting: Targeted Stimulation and Dopaminergic Plasticity Modulation for Unsupervised Lifelong Learning in Spiking Neural Networks

Stochastic gradient descent requires that training samples be drawn from a uniformly random distribution of the data. For a deployed system that must learn online from an uncontrolled and unknown environment, the ordering of input samples often fails to meet this criterion, making lifelong learning a difficult challenge. We exploit the locality of the unsupervised Spike Timing Dependent Plasticity (STDP) learning rule to target local representations in a Spiking Neural Network (SNN) to adapt to novel information while protecting essential information in the remainder of the SNN from catastrophic forgetting. In our Controlled Forgetting Networks (CFNs), novel information triggers stimulated firing and heterogeneously modulated plasticity, inspired by biological dopamine signals, to cause rapid and isolated adaptation in the synapses of neurons associated with outlier information. This targeting controls the forgetting process in a way that reduces the degradation of accuracy for older tasks while learning new tasks. Our experimental results on the MNIST dataset validate the capability of CFNs to learn successfully over time from an unknown, changing environment, achieving 95.36% accuracy, which we believe is the best unsupervised accuracy ever achieved by a fixed-size, single-layer SNN on a completely disjoint MNIST dataset.

preprint2019arXiv

Extracting many-body correlators of saturated gluons with precision from inclusive photon+dijet final states in deeply inelastic scattering

We highlight the principal results of a computation in the Color Glass Condensate effective field theory (CGC EFT) of the next-to-leading order (NLO) impact factor for inclusive photon+dijet production at Bjorken $x_{\rm Bj} \ll 1$ in deeply inelastic electron-nucleus (e+A DIS) collisions. When combined with extant results for next-to-leading log $x_{\rm Bj}$ JIMWLK renormalization group (RG) evolution of gauge invariant two-point ("dipole") and four-point ("quadrupole") correlators of light-like Wilson lines, the inclusive photon+dijet e+A DIS cross-section can be determined to $\sim 10$\% accuracy. Our computation simultaneously provides the ingredients to compute fully inclusive DIS, inclusive photon, inclusive dijet and inclusive photon+jet channels to the same accuracy. This makes feasible quantitative extraction of many-body correlators of saturated gluons and precise determination of the saturation scale $Q_{S,A}(x_{\rm Bj})$ at a future Electron-Ion Collider. An interesting feature of our NLO result is the structure of the violation of the soft gluon theorem in the Regge limit. Another is the appearance in gluon emission of time-like non-global logs which also satisfy JIMWLK RG evolution.

preprint2019arXiv

Incremental Learning in Deep Convolutional Neural Networks Using Partial Network Sharing

Deep convolutional neural network (DCNN) based supervised learning is a widely practiced approach for large-scale image classification. However, retraining these large networks to accommodate new, previously unseen data demands high computational time and energy requirements. Also, previously seen training samples may not be available at the time of retraining. We propose an efficient training methodology and incrementally growing DCNN to learn new tasks while sharing part of the base network. Our proposed methodology is inspired by transfer learning techniques, although it does not forget previously learned tasks. An updated network for learning new set of classes is formed using previously learned convolutional layers (shared from initial part of base network) with addition of few newly added convolutional kernels included in the later layers of the network. We employed a `clone-and-branch' technique which allows the network to learn new tasks one after another without any performance loss in old tasks. We evaluated the proposed scheme on several recognition applications. The classification accuracy achieved by our approach is comparable to the regular incremental learning approach (where networks are updated with new training samples only, without any network sharing), while achieving energy efficiency, reduction in storage requirements, memory access and training time.

preprint2019arXiv

NLO impact factor for inclusive photon$+$dijet production in $e+A$ DIS at small $x$

We compute the next-to-leading order (NLO) impact factor for inclusive photon $+$dijet production in electron-nucleus (e+A) deeply inelastic scattering (DIS) at small $x$. An important ingredient in our computation is the simple structure of ``shock wave" fermion and gluon propagators. This allows one to employ standard momentum space Feynman diagram techniques for higher order computations in the Regge limit of fixed $Q^2\gg Λ_{\rm QCD}^2$ and $x\rightarrow 0$. Our computations in the Color Glass Condensate (CGC) effective field theory include the resummation of all-twist power corrections $Q_s^2/Q^2$, where $Q_s$ is the saturation scale in the nucleus. We discuss the structure of ultraviolet, collinear and soft divergences in the CGC, and extract the leading logs in $x$; the structure of the corresponding rapidity divergences gives a nontrivial first principles derivation of the JIMWLK renormalization group evolution equation for multiparton lightlike Wilson line correlators. Explicit expressions are given for the $x$-independent $O(α_s)$ contributions that constitute the NLO impact factor. These results, combined with extant results on NLO JIMWLK evolution, provide the ingredients to compute the inclusive photon $+$ dijet cross-section at small $x$ to $O(α_s^3 \ln(x))$. First results for the NLO impact factor in inclusive dijet production are recovered in the soft photon limit. A byproduct of our computation is the LO photon+ 3 jet (quark-antiquark-gluon) cross-section.

preprint2014arXiv

Design and Synthesis of Ultra Low Energy Spin-Memristor Threshold Logic

A threshold logic gate (TLG) performs weighted sum of multiple inputs and compares the sum with a threshold. We propose Spin-Memeristor Threshold Logic (SMTL) gates, which employ memristive cross-bar array (MCA) to perform current-mode summation of binary inputs, whereas, the low-voltage fast-switching spintronic threshold devices (STD) carry out the threshold operation in an energy efficient manner. Field programmable SMTL gate arrays can operate at a small terminal voltage of ~50mV, resulting in ultra-low power consumption in gates as well as programmable interconnect networks. We evaluate the performance of SMTL using threshold logic synthesis. Results for common benchmarks show that SMTL based programmable logic hardware can be more than 100x energy efficient than state of the art CMOS FPGA.

preprint2014arXiv

Laser Induced Magnetization Reversal for Detection in Optical Interconnects

Optical interconnect has emerged as the front-runner to replace electrical interconnect especially for off-chip communication. However, a major drawback with optical interconnects is the need for photodetectors and amplifiers at the receiver, implemented usually by direct bandgap semiconductors and analog CMOS circuits, leading to large energy consumption and slow operating time. In this article, we propose a new optical interconnect architecture that uses a magnetic tunnel junction (MTJ) at the receiver side that is switched by femtosecond laser pulses. The state of the MTJ can be sensed using simple digital CMOS latches, resulting in significant improvement in energy consumption. Moreover, magnetization in the MTJ can be switched on the picoseconds time-scale and our design can operate at a speed of 5 Gbits/sec for a single link.

preprint2013arXiv

Boolean and Non-Boolean Computation With Spin Devices

Recently several device and circuit design techniques have been explored for applying nano-magnets and spin torque devices like spin valves and domain wall magnets in computational hardware. However, most of them have been focused on digital logic, and, their benefits over robust and high performance CMOS remains debatable. Ultra-low voltage, current-switching operation of magneto-metallic spin torque devices can potentially be more suitable for non-Boolean computation schemes that can exploit current-mode analog processing. Device circuit co-design for different classes of non-Boolean-architectures using spin-torque based neuron models in spin-CMOS hybrid circuits show that the spin-based non-Boolean designs can achieve 15X-100X lower computation energy for applications like, image-processing, data-conversion, cognitive-computing, pattern matching and programmable-logic, as compared to state of art CMOS designs.

preprint2013arXiv

DSTT-MRAM: Differential Spin Hall MRAM for On-chip Memories

A new device structure for spin transfer torque based magnetic random access memory is proposed for on-chip memory applications. Our device structure exploits spin Hall effect to create a differential memory cell that exhibits fast and energy-efficient write operation. Moreover, due to inherently differential device structure, fast and reliable read operation can be performed. Our simulation study shows 10X improvement in write energy over the standard 1T1R STT-MRAM memory cell, and 1.6X faster read operation compared to single-ended sensing (as in standard 1T1R STT-MRAMs). The bit-cell characteristics are promising for high performance on-chip memory applications.

preprint2013arXiv

Energy-Efficient and Robust Associative Computing with Electrically Coupled Dual Pillar Spin-Torque Oscillators

Dynamics of coupled spin-torque oscillators can be exploited for non-Boolean information processing. However, the feasibility of coupling large number of STOs with energy-efficiency and sufficient robustness towards parameter-variation and thermal-noise, may be critical for such computing applications. In this work, the impacts of parameter-variation and thermal-noise on two different coupling mechanisms for STOs, namely, magnetic-coupling and electrical-coupling are analyzed. Magnetic coupling is simulated using dipolar-field interactions. For electricalcoupling we employed global RF-injection. In this method, multiple STOs are phase-locked to a common RF-signal that is injected into the STOs along with the DC bias. Results for variation and noise analysis indicate that electrical-coupling can be significantly more robust as compared to magnetic-coupling. For room-temperature simulations, appreciable phase-lock was retained among tens of electrically coupled STOs for up to 20% 3s random variations in critical device parameters. The magnetic-coupling technique however failed to retain locking beyond ~3% 3s parameter-variations, even for small-size STO clusters with near-neighborhood connectivity. We propose and analyze Dual-Pillar STO (DP-STO) for low-power computing using the proposed electrical coupling method. We observed that DP-STO can better exploit the electrical-coupling technique due to separation between the biasing RF signal and its own RF output.

preprint2013arXiv

Exploring Boolean and Non-Boolean Computing Applications of Spin Torque Devices

In this paper we discuss the potential of emerging spintorque devices for computing applications. Recent proposals for spinbased computing schemes may be differentiated as all-spin vs. hybrid, programmable vs. fixed, and, Boolean vs. non-Boolean. All spin logic-styles may offer high area-density due to small form-factor of nano-magnetic devices. However, circuit and system-level design techniques need to be explored that leverage the specific spin-device characteristics to achieve energy-efficiency, performance and reliability comparable to those of CMOS. The non-volatility of nanomagnets can be exploited in the design of energy and area-efficient programmable logic. In such logic-styles, spin-devices may play the dual-role of computing as well as memory-elements that provide field-programmability. Spin-based threshold logic design is presented as an example (dynamic resisitve threshold logic and magnetic threshold logic). Emerging spintronic phenomena may lead to ultralow- voltage, current-mode, spin-torque switches that can offer attractive computing capabilities, beyond digital switches. Such devices may be suitable for non-Boolean data-processing applications which involve analog processing. Integration of such spin-torque devices with charge-based devices like CMOS and resistive memory can lead to highly energy-efficient information processing hardware for applications like pattern-matching, neuromorphic-computing, image-processing and data-conversion. Towards the end, we discuss the possibility of applying emerging spin-torque switches in the design of energy-efficient global interconnects, for future chip multiprocessors.

preprint2013arXiv

Exploring Ultra Low-Power on-Chip Clocking Using Functionality Enhanced Spin-Torque Switches

Emerging spin-torque (ST) phenomena may lead to ultra-low-voltage, high-speed nano-magnetic switches. Such current-based-switches can be attractive for designing low swing global-interconnects, like, clocking-networks and databuses. In this work we present the basic idea of using such ST-switches for low-power on-chip clocking. For clockingnetworks, Spin-Hall-Effect (SHE) can be used to produce an assist-field for fast ST-switching using global-mesh-clock with less than 100mV swing. The ST-switch acts as a compact-latch, written by ultra-low-voltage input-pulses. The data is read using a high-resistance tunnel-junction. The clock-driven SHE write-assist can be shared among large number of ST-latches, thereby reducing the load-capacitance for clock-distribution. The SHE assist can be activated by a low-swing clock (~150mV) and hence can facilitate ultra-low voltage clock-distribution. Owing to reduced clock-load and low-voltage operation, the proposed scheme can achieve 97% low-power for on-chip clocking as compared to the state of the art CMOS design. Rigorous device-circuit simulations and system-level modelling for the proposed scheme will be addressed in future.

preprint2013arXiv

Spintronic Switches for Ultra Low Energy On-Chip and Inter-Chip Current-Mode Interconnects

Energy-efficiency and design-complexity of high-speed on-chip and inter-chip data-interconnects has emerged as the major bottleneck for high-performance computing-systems. As a solution, we propose an ultra-low energy interconnect design-scheme using nano-scale spintorque switches. In the proposed method, data is transmitted in the form of current-pulses, with amplitude of the order of few micro-amperes that flows across a small terminal-voltage of less than 50mV. Sub-nanosecond spintorque switching of scaled nano-magnets can be used to receive and convert such high-speed current-mode signal into binary voltage-levels using magnetic-tunnel-junction (MTJ), with the help of simple CMOS inverter. As a result of low-voltage, low-current signaling and minimal signal-conversion overhead, the proposed technique can facilitate highly compact and simplified designs for multi-gigahertz inter-chip and on-chip data-communication links. Such links can achieve more than ~100x higher energy-efficiency, as compared to state of the art CMOS interconnects.

preprint2013arXiv

Ultra Low Power Associative Computing with Spin Neurons and Resistive Crossbar Memory

Emerging resistive-crossbar memory (RCM) technology can be promising for computationally-expensive analog pattern-matching tasks. However, the use of CMOS analog-circuits with RCM would result in large power-consumption and poor scalability, thereby eschewing the benefits of RCM-based computation. We propose the use of low-voltage, fast-switching, magneto-metallic spin-neurons for ultra low-power non-Boolean computing with RCM. We present the design of analog associative memory for face recognition using RCM, where, substituting conventional analog circuits with spin-neurons can achieve ~100x lower power. This makes the proposed design ~1000x more energy-efficient than a 45nm-CMOS digital ASIC, thereby significantly enhancing the prospects of RCM based computational hardware.

preprint2013arXiv

Ultra-High Density, High-Performance and Energy-Efficient All Spin Logic

All Spin Logic gates employ multiple nano-magnets interacting through spin-torque using non-magnetic channels. Compactness, non-volatility and ultra-low voltage operation are some of the attractive features of ASL, while, low switching-speed (of nano-magnets as compared to CMOS gates) and static-power dissipation can be identified as the major bottlenecks. In this work we explore design techniques that leverage the specific device characteristics of ASL to overcome the inefficiencies and to enhance the merits of this technology, for a given set of device parameters. We exploit the non-volatility of nano-magnets to model fully-pipelined ASL that can achieve higher performance. Clocking of power supply in pipelined ASL would require CMOS transistors that may consume significantly large voltage headroom and area, as compared to the nano-magnets. We show that the use of leaky transistors can significantly mitigate such bottlenecks, without sacrificing energy-efficiency and robustness. Exploiting the inherent isolation between the biasing charge current and spin-current paths in ASL, we propose to stack multiple ASL metal layers, leading to ultra-high-density and energy-efficient 3-D computation blocks. Results for the design of an FIR filter show that ASL can achieve performance and power consumption comparable to CMOS while the ultra-high-density of ASL can be projected as its main advantage over CMOS.

preprint2013arXiv

Ultra-low Energy, High Performance and Programmable Magnetic Threshold Logic

We propose magnetic threshold-logic (MTL) design based on non-volatile spin-torque switches. A threshold logic gate (TLG) performs summation of multiple inputs multiplied by a fixed set of weights and compares the sum with a threshold. MTL employs resistive states of magnetic tunnel junctions as programmable input weights, while, a low-voltage domain-wall shift based spin-torque switch is used for thresholding operation. The resulting MTL gate acts as a low-power, configurable logic unit and can be used to build fully pipelined, high-performance programmable computing blocks. Multiple stages in such a MTL design can be connected using energy-efficient ultralow swing programmable interconnect networks based on resistive switches. Owing to memory-based compact logic and interconnect design and low-voltage, high-speed spintorque based threshold operation, MTL can achieve more than two orders of magnitude improvement in energy-delay product as compared to look-up table based CMOS FPGA.

preprint2013arXiv

Ultra-low Energy, High-Performance Dynamic Resistive Threshold Logic

We propose dynamic resistive threshold-logic (DRTL) design based on non-volatile resistive memory. A threshold logic gate (TLG) performs summation of multiple inputs multiplied by a fixed set of weights and compares the sum with a threshold. DRTL employs resistive memory elements to implement the weights and the thresholds, while a compact dynamic CMOS latch is used for the comparison operation. The resulting DRTL gate acts as a low-power, configurable dynamic logic unit and can be used to build fully pipelined, high-performance programmable computing blocks. Multiple stages in such a DRTL design can be connected using energy-efficient low swing programmable interconnect networks based on resistive switches. Owing to memory-based compact logic and interconnect design and highspeed dynamic-pipelined operation, DRTL can achieve more than two orders of magnitude improvement in energy-delay product as compared to look-up table based CMOS FPGA.

preprint2012arXiv

Proposal For Neuromorphic Hardware Using Spin Devices

We present a design-scheme for ultra-low power neuromorphic hardware using emerging spin-devices. We propose device models for 'neuron', based on lateral spin valves and domain wall magnets that can operate at ultra-low terminal voltage of ~20 mV, resulting in small computation energy. Magnetic tunnel junctions are employed for interfacing the spin-neurons with charge-based devices like CMOS, for large-scale networks. Device-circuit co-simulation-framework is used for simulating such hybrid designs, in order to evaluate system-level performance. We present the design of different classes of neuromorphic architectures using the proposed scheme that can be suitable for different applications like, analog-data-sensing, data-conversion, cognitive-computing, associative memory, programmable-logic and analog and digital signal processing. We show that the spin-based neuromorphic designs can achieve 15X-300X lower computation energy for these applications; as compared to state of art CMOS designs.

preprint2012arXiv

Spin-Based Neuron Model with Domain Wall Magnets as Synapse

We present artificial neural network design using spin devices that achieves ultra low voltage operation, low power consumption, high speed, and high integration density. We employ spin torque switched nano-magnets for modelling neuron and domain wall magnets for compact, programmable synapses. The spin based neuron-synapse units operate locally at ultra low supply voltage of 30mV resulting in low computation power. CMOS based inter-neuron communication is employed to realize network-level functionality. We corroborate circuit operation with physics based models developed for the spin devices. Simulation results for character recognition as a benchmark application shows 95% lower power consumption as compared to 45nm CMOS design.

preprint2011arXiv

Thermoelectric Spin-Transfer Torque MRAM with Sub-Nanosecond Bi-Directional Writing using Magnonic Current

A new genre of Spin-Transfer Torque (STT) MRAM is proposed, in which bi-directional writing is achieved using thermoelectrically controlled magnonic current as an alternative to conventional electric current. The device uses a magnetic tunnel junction (MTJ), which is adjacent to a non-magnetic metallic and a ferrite film. This film stack is heated or cooled by a Peltier element which creates a bi-directional magnonic pulse in the ferrite film. Conversion of magnons to spin current occurs at the ferrite-metal interface, and the resulting spin-transfer torque is used to achieve sub-nanosecond precessional switching of the ferromagnetic free layer in the MTJ. Compared to electric current driven STT-MRAM with perpendicular magnetic anisotropy (PMA), thermoelectric STT-MRAM reduces the overall magnetization switching energy by more than 40% for nano-second switching, combined with a write error rate (WER) of less than 10-9 and a lifetime of 10 years or higher. The combination of higher thermal activation energy, sub-nanosecond read/write speed, improved tunneling magneto-resistance (TMR) and tunnel barrier reliability make thermoelectric STT-MRAM a promising choice for future non-volatile memory applications.

preprint2010arXiv

Improved current saturation and shifted switching threshold voltage in In2O3 nanowire based, fully transparent NMOS inverters via femtosecond laser annealing

Transistors based on various types of non-silicon nanowires have shown great potential for a variety of applications, especially for those require transparency and low-temperature substrates. However, critical requirements for circuit functionality such as saturated source-drain current, and matched threshold voltages of individual nanowire transistors in a way that is compatible with low temperature substrates, have not been achieved. Here we show that femtosecond laser pulses can anneal individual transistors based on In2O3 nanowires, improve the saturation of the source-drain current, and permanently shift the threshold voltage to the positive direction. We applied this technique and successfully shifted the switching threshold voltages of NMOS based inverters and improved their noise margin, in both depletion and enhancement modes. Our demonstration provides a method to trim the parameters of individual nanowire transistors, and suggests potential for large-scale integration of nanowire-based circuit blocks and systems.

preprint2007arXiv

Modeling and Analysis of Loading Effect in Leakage of Nano-Scaled Bulk-CMOS Logic Circuits

In nanometer scaled CMOS devices significant increase in the subthreshold, the gate and the reverse biased junction band-to-band-tunneling (BTBT) leakage, results in the large increase of total leakage power in a logic circuit. Leakage components interact with each other in device level (through device geometry, doping profile) and also in the circuit level (through node voltages). Due to the circuit level interaction of the different leakage components, the leakage of a logic gate strongly depends on the circuit topology i.e. number and nature of the other logic gates connected to its input and output. In this paper, for the first time, we have analyzed loading effect on leakage and proposed a method to accurately estimate the total leakage in a logic circuit, from its logic level description considering the impact of loading and transistor stacking.

preprint2007arXiv

Statistical Modeling of Pipeline Delay and Design of Pipeline under Process Variation to Enhance Yield in sub-100nm Technologies

Operating frequency of a pipelined circuit is determined by the delay of the slowest pipeline stage. However, under statistical delay variation in sub-100nm technology regime, the slowest stage is not readily identifiable and the estimation of the pipeline yield with respect to a target delay is a challenging problem. We have proposed analytical models to estimate yield for a pipelined design based on delay distributions of individual pipe stages. Using the proposed models, we have shown that change in logic depth and imbalance between the stage delays can improve the yield of a pipeline. A statistical methodology has been developed to optimally design a pipeline circuit for enhancing yield. Optimization results show that, proper imbalance among the stage delays in a pipeline improves design yield by 9% for the same area and performance (and area reduction by about 8.4% under a yield constraint) over a balanced design.