Source author record

Mehdi Kamal

Mehdi Kamal appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture Machine Learning Computation and Language Computational Complexity

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A2P-MANN: Adaptive Attention Inference Hops Pruned Memory-Augmented Neural Networks

In this work, to limit the number of required attention inference hops in memory-augmented neural networks, we propose an online adaptive approach called A2P-MANN. By exploiting a small neural network classifier, an adequate number of attention inference hops for the input query is determined. The technique results in elimination of a large number of unnecessary computations in extracting the correct answer. In addition, to further lower computations in A2P-MANN, we suggest pruning weights of the final FC (fully-connected) layers. To this end, two pruning approaches, one with negligible accuracy loss and the other with controllable loss on the final accuracy, are developed. The efficacy of the technique is assessed by using the twenty question-answering (QA) tasks of bAbI dataset. The analytical assessment reveals, on average, more than 42% fewer computations compared to the baseline MANN at the cost of less than 1% accuracy loss. In addition, when used along with the previously published zero-skipping technique, a computation count reduction of up to 68% is achieved. Finally, when the proposed approach (without zero-skipping) is implemented on the CPU and GPU platforms, up to 43% runtime reduction is achieved.

preprint2022arXiv

AMR-MUL: An Approximate Maximally Redundant Signed Digit Multiplier

In this paper, we present an energy-efficient, yet high-speed approximate maximally redundant signed digit (MRSD) multiplier (called AMR-MUL) based on a parallel structure. For the reduction stage, we suggest several approximate Full-Adder (FA) reduction cells with average positive and negative errors obtained by simplifying the structure of an exact FA cell. The optimum selection of these cells for each partial product reduction stage provides the lowest possible error, turning this task into a design space exploration problem. We also provide a branch-and-bound design space exploration algorithm to find the optimal assignment of reduction cells based on a predefined constraint (i.e., the width of the approximate part) by the user. The effectiveness of the proposed (Radix-16) multiplier design is assessed under different digit counts and approximate border column. The results show that the energy consumption of the MRSD multiplier is reduced by 7x at the cost of a 1.6% accuracy loss.

preprint2022arXiv

Heterogeneous Multi-core Array-based DNN Accelerator

In this article, we investigate the impact of architectural parameters of array-based DNN accelerators on accelerator's energy consumption and performance in a wide variety of network topologies. For this purpose, we have developed a tool that simulates the execution of neural networks on array-based accelerators and has the capability of testing different configurations for the estimation of energy consumption and processing latency. Based on our analysis of the behavior of benchmark networks under different architectural parameters, we offer a few recommendations for having an efficient yet high performance accelerator design. Next, we propose a heterogeneous multi-core chip scheme for deep neural network execution. The evaluations of a selective small search space indicate that the execution of neural networks on their near-optimal core configuration can save up to 36% and 67% of energy consumption and energy-delay product respectively. Also, we suggest an algorithm to distribute the processing of network's layers across multiple cores of the same type in order to speed up the computations through model parallelism. Evaluations on different networks and with the different number of cores verify the effectiveness of the proposed algorithm in speeding up the processing to near-optimal values.

preprint2021arXiv

BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification

In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.

Mehdi Kamal

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

A2P-MANN: Adaptive Attention Inference Hops Pruned Memory-Augmented Neural Networks

AMR-MUL: An Approximate Maximally Redundant Signed Digit Multiplier

Heterogeneous Multi-core Array-based DNN Accelerator

BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification