Researcher profile

Massoud Pedram

Massoud Pedram contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
19works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

19 published item(s)

preprint2022arXiv

A Fast and Efficient Conditional Learning for Tunable Trade-Off between Accuracy and Robustness

Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. These layers require many new parameters and are hyperparameter sensitive. They significantly increase training time, memory cost, and potential latency which can prove costly for resource-limited or real-time applications. In this paper, we present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer, thereby incurring no significant increase in parameter count, training time, or network latency compared to standard adversarial training. In particular, we add configurable scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance. Extensive experiments show that FLOAT can yield SOTA performance improving both clean and perturbed image classification by up to ~6% and ~10%, respectively. Moreover, real hardware measurement shows that FLOAT can reduce the training time by up to 1.43x with fewer model parameters of up to 1.47x on iso-hyperparameter settings compared to the FiLM-based alternatives. Additionally, to further improve memory efficiency we introduce FLOAT sparse (FLOATS), a form of non-iterative model pruning and provide detailed empirical analysis to provide a three way accuracy-robustness-complexity trade-off for these new class of pruned conditionally trained models.

preprint2022arXiv

A2P-MANN: Adaptive Attention Inference Hops Pruned Memory-Augmented Neural Networks

In this work, to limit the number of required attention inference hops in memory-augmented neural networks, we propose an online adaptive approach called A2P-MANN. By exploiting a small neural network classifier, an adequate number of attention inference hops for the input query is determined. The technique results in elimination of a large number of unnecessary computations in extracting the correct answer. In addition, to further lower computations in A2P-MANN, we suggest pruning weights of the final FC (fully-connected) layers. To this end, two pruning approaches, one with negligible accuracy loss and the other with controllable loss on the final accuracy, are developed. The efficacy of the technique is assessed by using the twenty question-answering (QA) tasks of bAbI dataset. The analytical assessment reveals, on average, more than 42% fewer computations compared to the baseline MANN at the cost of less than 1% accuracy loss. In addition, when used along with the previously published zero-skipping technique, a computation count reduction of up to 68% is achieved. Finally, when the proposed approach (without zero-skipping) is implemented on the CPU and GPU platforms, up to 43% runtime reduction is achieved.

preprint2022arXiv

AMR-MUL: An Approximate Maximally Redundant Signed Digit Multiplier

In this paper, we present an energy-efficient, yet high-speed approximate maximally redundant signed digit (MRSD) multiplier (called AMR-MUL) based on a parallel structure. For the reduction stage, we suggest several approximate Full-Adder (FA) reduction cells with average positive and negative errors obtained by simplifying the structure of an exact FA cell. The optimum selection of these cells for each partial product reduction stage provides the lowest possible error, turning this task into a design space exploration problem. We also provide a branch-and-bound design space exploration algorithm to find the optimal assignment of reduction cells based on a predefined constraint (i.e., the width of the approximate part) by the user. The effectiveness of the proposed (Radix-16) multiplier design is assessed under different digit counts and approximate border column. The results show that the energy consumption of the MRSD multiplier is reduced by 7x at the cost of a 1.6% accuracy loss.

preprint2022arXiv

DigiQ: A Scalable Digital Controller for Quantum Computers Using SFQ Logic

The control of cryogenic qubits in today's superconducting quantum computer prototypes presents significant scalability challenges due to the massive costs of generating/routing the analog control signals that need to be sent from a classical controller at room temperature to the quantum chip inside the dilution refrigerator. Thus, researchers in industry and academia have focused on designing in-fridge classical controllers in order to mitigate these challenges. Superconducting Single Flux Quantum (SFQ) is a classical logic family proposed for large-scale in-fridge controllers. SFQ logic has the potential to maximize scalability thanks to its ultra-high speed and very low power consumption. However, architecture design for SFQ logic poses challenges due to its unconventional pulse-driven nature and lack of dense memory and logic. Thus, research at the architecture level is essential to guide architects to design SFQ-based classical controllers for large-scale quantum machines. In this paper, we present DigiQ, the first system-level design of a Noisy Intermediate Scale Quantum (NISQ)-friendly SFQ-based classical controller. We perform a design space exploration of SFQ-based controllers and co-design the quantum gate decompositions and SFQ-based implementation of those decompositions to find an optimal SFQ-friendly design point that trades area and power for latency and control while ensuring good quantum algorithmic performance. Our co-design results in a single instruction, multiple data (SIMD) controller architecture, which has high scalability (>42,000-qubit scales), but imposes new challenges on the calibration of control pulses. We present software-level solutions to address these challenges, which if unaddressed would degrade quantum circuit fidelity given the imperfections of qubit hardware.

preprint2022arXiv

Efficient Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors Targeting Neural Network Inference and Utilizing High-level Synthesis

Recent efforts for improving the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed function combinational logic. Mapping such large Boolean functions with many input variables and product terms to digital signal processors (DSPs) on Field-programmable gate arrays (FPGAs) needs a novel framework considering the structure and the reconfigurability of DSP blocks during this process. The proposed methodology in this paper maps the fixed function combinational logic blocks to a set of Boolean functions where Boolean operations corresponding to each function are mapped to DSP devices rather than look-up tables (LUTs) on the FPGAs to take advantage of the high performance, low latency, and parallelism of DSP blocks. % This paper also presents an innovative design and optimization methodology for compilation and mapping of NNs, utilizing fixed function combinational logic to DSPs on FPGAs employing high-level synthesis flow. % Our experimental evaluations across several \REVone{datasets} and selected NNs demonstrate the comparable performance of our framework in terms of the inference latency and output accuracy compared to prior art FPGA-based NN accelerators employing DSPs.

preprint2022arXiv

Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators

This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49x energy efficiency gain while lowering storage requirements by 3.67x for total weight storage (non-pruned weights plus indexing) and 22,044x for indexing memory.

preprint2021arXiv

BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification

In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.

preprint2020arXiv

CSM-NN: Current Source Model Based Logic Circuit Simulation -- A Neural Network Approach

The miniaturization of transistors down to 5nm and beyond, plus the increasing complexity of integrated circuits, significantly aggravate short channel effects, and demand analysis and optimization of more design corners and modes. Simulators need to model output variables related to circuit timing, power, noise, etc., which exhibit nonlinear behavior. The existing simulation and sign-off tools, based on a combination of closed-form expressions and lookup tables are either inaccurate or slow, when dealing with circuits with more than billions of transistors. In this work, we present CSM-NN, a scalable simulation framework with optimized neural network structures and processing algorithms. CSM-NN is aimed at optimizing the simulation time by accounting for the latency of the required memory query and computation, given the underlying CPU and GPU parallel processing capabilities. Experimental results show that CSM-NN reduces the simulation time by up to $6\times$ compared to a state-of-the-art current source model based simulator running on a CPU. This speedup improves by up to $15\times$ when running on a GPU. CSM-NN also provides high accuracy levels, with less than $2\%$ error, compared to HSPICE.

preprint2020arXiv

Deep-PowerX: A Deep Learning-Based Framework for Low-Power Approximate Logic Synthesis

This paper aims at integrating three powerful techniques namely Deep Learning, Approximate Computing, and Low Power Design into a strategy to optimize logic at the synthesis level. We utilize advances in deep learning to guide an approximate logic synthesis engine to minimize the dynamic power consumption of a given digital CMOS circuit, subject to a predetermined error rate at the primary outputs. Our framework, Deep-PowerX, focuses on replacing or removing gates on a technology-mapped network and uses a Deep Neural Network (DNN) to predict error rates at primary outputs of the circuit when a specific part of the netlist is approximated. The primary goal of Deep-PowerX is to reduce the dynamic power whereas area reduction serves as a secondary objective. Using the said DNN, Deep-PowerX is able to reduce the exponential time complexity of standard approximate logic synthesis to linear time. Experiments are done on numerous open source benchmark circuits. Results show significant reduction in power and area by up to 1.47 times and 1.43 times compared to exact solutions and by up to 22% and 27% compared to state-of-the-art approximate logic synthesis tools while having orders of magnitudes lower run-time.

preprint2020arXiv

Efficient Training of Deep Convolutional Neural Networks by Augmentation in Embedding Space

Recent advances in the field of artificial intelligence have been made possible by deep neural networks. In applications where data are scarce, transfer learning and data augmentation techniques are commonly used to improve the generalization of deep learning models. However, fine-tuning a transfer model with data augmentation in the raw input space has a high computational cost to run the full network for every augmented input. This is particularly critical when large models are implemented on embedded devices with limited computational and energy resources. In this work, we propose a method that replaces the augmentation in the raw input space with an approximate one that acts purely in the embedding space. Our experimental results show that the proposed method drastically reduces the computation, while the accuracy of models is negligibly compromised.

preprint2020arXiv

HIPE-MAGIC: A Technology-Aware Synthesis and Mapping Flow for HIghly Parallel Execution of Memristor-Aided LoGIC

Recent efforts for finding novel computing paradigms that meet today's design requirements have given rise to a new trend of processing-in-memory relying on non-volatile memories. In this paper, we present HIPE-MAGIC, a technology-aware synthesis and mapping flow for highly parallel execution of the memristor-based logic. Our framework is built upon two fundamental contributions: balancing techniques during the logic synthesis, mainly targeting benefits of the parallelism offered by memristive crossbar arrays (MCAs), and an efficient technology mapping framework to maximize the performance and area-efficiency of the memristor-based logic. Our experimental evaluations across several benchmark suites demonstrate the superior performance of HIPE-MAGIC in terms of throughput and energy efficiency compared to recently developed synthesis and mapping flows targeting MCAs, as well as the conventional CPU computing.

preprint2020arXiv

JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile Cloud Computing Services

Deep learning models are being deployed in many mobile intelligent applications. End-side services, such as intelligent personal assistants, autonomous cars, and smart home services often employ either simple local models on the mobile or complex remote models on the cloud. However, recent studies have shown that partitioning the DNN computations between the mobile and cloud can increase the latency and energy efficiencies. In this paper, we propose an efficient, adaptive, and practical engine, JointDNN, for collaborative computation between a mobile device and cloud for DNNs in both inference and training phase. JointDNN not only provides an energy and performance efficient method of querying DNNs for the mobile side but also benefits the cloud server by reducing the amount of its workload and communications compared to the cloud-only approach. Given the DNN architecture, we investigate the efficiency of processing some layers on the mobile device and some layers on the cloud server. We provide optimization formulations at layer granularity for forward- and backward-propagations in DNNs, which can adapt to mobile battery limitations and cloud server load constraints and quality of service. JointDNN achieves up to 18 and 32 times reductions on the latency and mobile energy consumption of querying DNNs compared to the status-quo approaches, respectively.

preprint2020arXiv

Logic Verification of Ultra-Deep Pipelined Beyond-CMOS Technologies

Traditional logical equivalence checking (LEC) which plays a major role in entire chip design process faces challenges of meeting the requirements demanded by the many emerging technologies that are based on logic models different from standard complementary metal oxide semiconductor (CMOS). In this paper, we propose a LEC framework to be employed in the verification process of beyond-CMOS circuits. Our LEC framework is compatible with existing CMOS technologies, but, also able to check features and capabilities that are unique to beyond-CMOS technologies. For instance, the performance of some emerging technologies benefits from ultra-deep pipelining and verification of such circuits requires new models and algorithms. We, therefore, present the Multi-Cycle Input Dependency (MCID) circuit model which is a novel model representation of design to explicitly capture the dependency of primary outputs of the circuit on sequences of internal signals and inputs. Embedding the proposed circuit model and several structural checking modules, the process of verification can be independent of the underlying technology and signaling. We benchmark the proposed framework on post-synthesis rapid single-flux-quantum (RSFQ) netlists. Results show a comparative verification time of RSFQ circuit benchmark including 32-bit Kogge-Stone adder, 16-bit integer divider, and ISCAS'85 circuits with respect to ABC tool for similar CMOS circuits.

preprint2020arXiv

NISQ+: Boosting quantum computing power by approximating quantum error correction

Quantum computers are growing in size, and design decisions are being made now that attempt to squeeze more computation out of these machines. In this spirit, we design a method to boost the computational power of near-term quantum computers by adapting protocols used in quantum error correction to implement "Approximate Quantum Error Correction (AQEC)." By approximating fully-fledged error correction mechanisms, we can increase the compute volume (qubits $\times$ gates, or "Simple Quantum Volume (SQV)") of near-term machines. The crux of our design is a fast hardware decoder that can approximately decode detected error syndromes rapidly. Specifically, we demonstrate a proof-of-concept that approximate error decoding can be accomplished online in near-term quantum systems by designing and implementing a novel algorithm in Single-Flux Quantum (SFQ) superconducting logic technology. This avoids a critical decoding backlog, hidden in all offline decoding schemes, that leads to idle time exponential in the number of T gates in a program. Our design utilizes one SFQ processing module per physical qubit. Employing state-of-the-art SFQ synthesis tools, we show that the circuit area, power, and latency are within the constraints of contemporary quantum system designs. Under pure dephasing error models, the proposed accelerator and AQEC solution is able to expand SQV by factors between 3,402 and 11,163 on expected near-term machines. The decoder achieves a $5\%$ accuracy-threshold and pseudo-thresholds of $\sim$ $5\%, 4.75\%, 4.5\%,$ and $3.5\%$ physical error-rates for code distances $3, 5, 7,$ and $9$. Decoding solutions are achieved in a maximum of $\sim 20$ nanoseconds on the largest code distances studied. By avoiding the exponential idle time in offline decoders, we achieve a $10$x reduction in required code distances to achieve the same logical performance as alternative designs.

preprint2020arXiv

NN-PARS: A Parallelized Neural Network Based Circuit Simulation Framework

The shrinking of transistor geometries as well as the increasing complexity of integrated circuits, significantly aggravate nonlinear design behavior. This demands accurate and fast circuit simulation to meet the design quality and time-to-market constraints. The existing circuit simulators which utilize lookup tables and/or closed-form expressions are either slow or inaccurate in analyzing the nonlinear behavior of designs with billions of transistors. To address these shortcomings, we present NN-PARS, a neural network (NN) based and parallelized circuit simulation framework with optimized event-driven scheduling of simulation tasks to maximize concurrency, according to the underlying GPU parallel processing capabilities. NN-PARS replaces the required memory queries in traditional techniques with parallelized NN-based computation tasks. Experimental results show that compared to a state-of-the-art current-based simulation method, NN-PARS reduces the simulation time by over two orders of magnitude in large circuits. NN-PARS also provides high accuracy levels in signal waveform calculations, with less than $2\%$ error compared to HSPICE.

preprint2020arXiv

Pre-defined Sparsity for Low-Complexity Convolutional Neural Networks

The high energy cost of processing deep convolutional neural networks impedes their ubiquitous deployment in energy-constrained platforms such as embedded systems and IoT devices. This work introduces convolutional layers with pre-defined sparse 2D kernels that have support sets that repeat periodically within and across filters. Due to the efficient storage of our periodic sparse kernels, the parameter savings can translate into considerable improvements in energy efficiency due to reduced DRAM accesses, thus promising significant improvements in the trade-off between energy consumption and accuracy for both training and inference. To evaluate this approach, we performed experiments with two widely accepted datasets, CIFAR-10 and Tiny ImageNet in sparse variants of the ResNet18 and VGG16 architectures. Compared to baseline models, our proposed sparse variants require up to 82% fewer model parameters with 5.6times fewer FLOPs with negligible loss in accuracy for ResNet18 on CIFAR-10. For VGG16 trained on Tiny ImageNet, our approach requires 5.8times fewer FLOPs and up to 83.3% fewer model parameters with a drop in top-5 (top-1) accuracy of only 1.2% (2.1%). We also compared the performance of our proposed architectures with that of ShuffleNet andMobileNetV2. Using similar hyperparameters and FLOPs, our ResNet18 variants yield an average accuracy improvement of 2.8%.

preprint2020arXiv

qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit

Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz.However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by re-designing the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent data-dependent operations increasing their throughput. In particular,we propose to group the bits into 4-bit blocks that are operatedon concurrently and create block-skewed datapath units for 32-bit operation. This skewed approach allows a subsequent data-dependent operation to start evaluating as soon as the first 4-bit block completes. Using this general approach, we developa block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4-bit bit-slice and 32-bit Ladner-Fischer ALUs respectively.

preprint2020arXiv

Runtime Deep Model Multiplexing for Reduced Latency and Energy Consumption Inference

We propose a learning algorithm to design a light-weight neural multiplexer that given the input and computational resource requirements, calls the model that will consume the minimum compute resources for a successful inference. Mobile devices can use the proposed algorithm to offload the hard inputs to the cloud while inferring the easy ones locally. Besides, in the large scale cloud-based intelligent applications, instead of replicating the most-accurate model, a range of small and large models can be multiplexed from depending on the input's complexity which will save the cloud's computational resources. The input complexity or hardness is determined by the number of models that can predict the correct label. For example, if no model can predict the label correctly, then the input is considered as the hardest. The proposed algorithm allows the mobile device to detect the inputs that can be processed locally and the ones that require a larger model and should be sent a cloud server. Therefore, the mobile user benefits from not only the local processing but also from an accurate model hosted on a cloud server. Our experimental results show that the proposed algorithm improves mobile's model accuracy by 8.52% which is because of those inputs that are properly selected and offloaded to the cloud server. In addition, it saves the cloud providers' compute resources by a factor of 2.85x as small models are chosen for easier inputs.

preprint2020arXiv

SynergicLearning: Neural Network-Based Feature Extraction for Highly-Accurate Hyperdimensional Learning

Machine learning models differ in terms of accuracy, computational/memory complexity, training time, and adaptability among other characteristics. For example, neural networks (NNs) are well-known for their high accuracy due to the quality of their automatic feature extraction while brain-inspired hyperdimensional (HD) learning models are famous for their quick training, computational efficiency, and adaptability. This work presents a hybrid, synergic machine learning model that excels at all the said characteristics and is suitable for incremental, on-line learning on a chip. The proposed model comprises an NN and a classifier. The NN acts as a feature extractor and is specifically trained to work well with the classifier that employs the HD computing framework. This work also presents a parameterized hardware implementation of the said feature extraction and classification components while introducing a compiler that maps any arbitrary NN and/or classifier to the aforementioned hardware. The proposed hybrid machine learning model has the same level of accuracy (i.e. $\pm$1%) as NNs while achieving at least 10% improvement in accuracy compared to HD learning models. Additionally, the end-to-end hardware realization of the hybrid model improves power efficiency by 1.60x compared to state-of-the-art, high-performance HD learning implementations while improving latency by 2.13x. These results have profound implications for the application of such synergic models in challenging cognitive tasks.