Researcher profile

Ahmed M. Eltawil

Ahmed M. Eltawil contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

Sparsity-Aware Streaming SNN Accelerator with Output-Channel Dataflow for Automatic Modulation Classification

The rapid advancement of wireless communication technologies, including 5G, emerging 6G networks, and the large-scale deployment of the Internet of Things (IoT), has intensified the need for efficient spectrum utilization. Automatic modulation classification (AMC) plays a vital role in cognitive radio systems by enabling real-time identification of modulation schemes for dynamic spectrum access and interference mitigation. While deep neural networks (DNNs) offer high classification accuracy, their computational and energy demands pose challenges for real-time edge deployment. Spiking neural networks (SNNs), with their event-driven nature, offer inherent energy efficiency, but achieving both high throughput and low power under constrained hardware resources remains challenging. This work proposes a sparsity-aware SNN streaming accelerator optimized for AMC tasks. Unlike traditional systolic arrays that exploit sparsity but suffer from low throughput, or streaming architectures that achieve high throughput but cannot fully utilize input and weight sparsity, our design integrates both advantages. By leveraging the fixed nature of kernels during inference, we apply the gated one-to-all product (GOAP) algorithm to compute only on non-zero input-weight intersections. Extra or empty iterations are precomputed and embedded into the inference dataflow, eliminating dynamic data fetches and enabling fully pipelined, control-free inter-layer execution. Implemented on an FPGA, our sparsity-aware output-channel dataflow streaming (SAOCDS) accelerator achieves 23.5 MS/s (approximately double the baseline throughput) on the RadioML 2016 dataset, while reducing dynamic power and maintaining comparable classification accuracy. These results demonstrate strong potential for real-time, low-power deployment in edge cognitive radio systems.

preprint2023arXiv

Performance of RIS-empowered NOMA-based D2D Communication under Nakagami-m Fading

Reconfigurable intelligent surfaces (RISs) have sparked a renewed interest in the research community envisioning future wireless communication networks. In this study, we analyzed the performance of RIS-enabled non-orthogonal multiple access (NOMA) based device-to-device (D2D) wireless communication system, where the RIS is partitioned to serve a pair of D2D users. Specifically, closed-form expressions are derived for the upper and lower limits of spectral efficiency (SE) and energy efficiency (EE). In addition, the performance of the proposed NOMA-based system is also compared with its orthogonal counterpart. Extensive simulation is done to corroborate the analytical findings. The results demonstrate that RIS highly enhances the performance of a NOMA-based D2D network.

preprint2022arXiv

BackLink: Supervised Local Training with Backward Links

Empowered by the backpropagation (BP) algorithm, deep neural networks have dominated the race in solving various cognitive tasks. The restricted training pattern in the standard BP requires end-to-end error propagation, causing large memory cost and prohibiting model parallelization. Existing local training methods aim to resolve the training obstacle by completely cutting off the backward path between modules and isolating their gradients to reduce memory cost and accelerate the training process. These methods prevent errors from flowing between modules and hence information exchange, resulting in inferior performance. This work proposes a novel local training algorithm, BackLink, which introduces inter-module backward dependency and allows errors to flow between modules. The algorithm facilitates information to flow backward along with the network. To preserve the computational advantage of local training, BackLink restricts the error propagation length within the module. Extensive experiments performed in various deep convolutional neural networks demonstrate that our method consistently improves the classification performance of local training algorithms over other methods. For example, in ResNet32 with 16 local modules, our method surpasses the conventional greedy local training method by 4.00\% and a recent work by 1.83\% in accuracy on CIFAR10, respectively. Analysis of computational costs reveals that small overheads are incurred in GPU memory costs and runtime on multiple GPUs. Our method can lead up to a 79\% reduction in memory cost and 52\% in simulation runtime in ResNet110 compared to the standard BP. Therefore, our method could create new opportunities for improving training algorithms towards better efficiency and biological plausibility.

preprint2022arXiv

Configurable Independent Component Analysis Preprocessing Accelerator

Independent component analysis (ICA) has been used in many applications, including self-interference cancellation for in-band full-duplex wireless systems and anomaly detection in industrial internet of things. This paper presents a high-throughput and highly efficient configurable preprocessing accelerator for the ICA algorithm. The proposed ICA accelerator has three major blocks that perform data centering, covariance matrix for computation, and eigenvalue decomposition (EVD). Specifically, the proposed accelerator is based on a high-performance matrix multiplication array (MMA). The proposed MMA architecture uses time-multiplexed processing so that the efficiency of hardware utilization is greatly enhanced. Furthermore, the processing flow utilizes parallel processing such that the centering, the calculation of the covariance matrix, and EVD are conducted simultaneously and are individually pipelined to maximize throughput. This paper presents the architecture, circuit design, and performance estimates based on post-layout extraction of the proposed preprocessing ICA accelerator. The proposed design achieves a throughput of 40.7 kMatrices per second at complexity of 73.3 kGE.

preprint2022arXiv

DNA Pattern Matching Acceleration with Analog Resistive CAM

DNA pattern matching is essential for many widely used bioinformatics applications. Disease diagnosis is one of these applications, since analyzing changes in DNA sequences can increase our understanding of possible genetic diseases. The remarkable growth in the size of DNA datasets has resulted in challenges in discovering DNA patterns efficiently in terms of run time and power consumption. In this paper, we propose an efficient hardware and software codesign that determines the chance of the occurrence of repeat-expansion diseases using DNA pattern matching. The proposed design parallelizes the DNA pattern matching task using associative memory realized with analog content-addressable memory and implements an algorithm that returns the maximum number of consecutive occurrences of a specific pattern within a DNA sequence. We fully implement all the required hardware circuits with PTM 45-nm technology, and we evaluate the proposed architecture on a practical human DNA dataset. The results show that our design is energy-efficient and significantly accelerates the DNA pattern matching task compared to previous approaches described in the literature.

preprint2022arXiv

Efficient Analog CAM Design

Content Addressable Memories (CAMs) are considered a key-enabler for in-memory computing (IMC). IMC shows order of magnitude improvement in energy efficiency and throughput compared to traditional computing techniques. Recently, analog CAMs (aCAMs) were proposed as a means to improve storage density and energy efficiency. In this work, we propose two new aCAM cells to improve data encoding and robustness as compared to existing aCAM cells. We propose a methodology to choose the margin and interval width for data encoding. In addition, we perform a comprehensive comparison against prior work in terms of the number of intervals, noise sensitivity, dynamic range, energy, latency, area, and probability of failure.

preprint2022arXiv

In-memory Associative Processors: Tutorial, Potential, and Challenges

In-memory computing is an emerging computing paradigm that overcomes the limitations of exiting Von-Neumann computing architectures such as the memory-wall bottleneck. In such paradigm, the computations are performed directly on the data stored in the memory, which highly reduces the memory-processor communications during computation. Hence, significant speedup and energy savings could be achieved especially with data-intensive applications. Associative processors (APs) were proposed in the seventies and recently were revived thanks to the high-density memories. In this tutorial brief, we overview the functionalities and recent trends of APs in addition to the implementation of each content-addressable memory with different technologies. The AP operations and runtime complexity are also summarized. We also explain and explore the possible applications that can benefit from APs. Finally, the AP limitations, challenges, and future directions are discussed.

preprint2021arXiv

Deep Learning Based Frequency-Selective Channel Estimation for Hybrid mmWave MIMO Systems

Millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems typically employ hybrid mixed signal processing to avoid expensive hardware and high training overheads. {However, the lack of fully digital beamforming at mmWave bands imposes additional challenges in channel estimation. Prior art on hybrid architectures has mainly focused on greedy optimization algorithms to estimate frequency-flat narrowband mmWave channels, despite the fact that in practice, the large bandwidth associated with mmWave channels results in frequency-selective channels. In this paper, we consider a frequency-selective wideband mmWave system and propose two deep learning (DL) compressive sensing (CS) based algorithms for channel estimation.} The proposed algorithms learn critical apriori information from training data to provide highly accurate channel estimates with low training overhead. In the first approach, a DL-CS based algorithm simultaneously estimates the channel supports in the frequency domain, which are then used for channel reconstruction. The second approach exploits the estimated supports to apply a low-complexity multi-resolution fine-tuning method to further enhance the estimation performance. Simulation results demonstrate that the proposed DL-based schemes significantly outperform conventional orthogonal matching pursuit (OMP) techniques in terms of the normalized mean-squared error (NMSE), computational complexity, and spectral efficiency, particularly in the low signal-to-noise ratio regime. When compared to OMP approaches that achieve an NMSE gap of \$\unit[\{4-10\}]{dB}\$ with respect to the Cramer Rao Lower Bound (CRLB), the proposed algorithms reduce the CRLB gap to only \$\unit[\{1-1.5\}]{dB}\$, while significantly reducing complexity by two orders of magnitude.

preprint2021arXiv

Efficient Training of Spiking Neural Networks with Temporally-Truncated Local Backpropagation through Time

Directly training spiking neural networks (SNNs) has remained challenging due to complex neural dynamics and intrinsic non-differentiability in firing functions. The well-known backpropagation through time (BPTT) algorithm proposed to train SNNs suffers from large memory footprint and prohibits backward and update unlocking, making it impossible to exploit the potential of locally-supervised training methods. This work proposes an efficient and direct training algorithm for SNNs that integrates a locally-supervised training method with a temporally-truncated BPTT algorithm. The proposed algorithm explores both temporal and spatial locality in BPTT and contributes to significant reduction in computational cost including GPU memory utilization, main memory access and arithmetic operations. We thoroughly explore the design space concerning temporal truncation length and local training block size and benchmark their impact on classification accuracy of different networks running different types of tasks. The results reveal that temporal truncation has a negative effect on the accuracy of classifying frame-based datasets, but leads to improvement in accuracy on dynamic-vision-sensor (DVS) recorded datasets. In spite of resulting information loss, local training is capable of alleviating overfitting. The combined effect of temporal truncation and local training can lead to the slowdown of accuracy drop and even improvement in accuracy. In addition, training deep SNNs models such as AlexNet classifying CIFAR10-DVS dataset leads to 7.26% increase in accuracy, 89.94% reduction in GPU memory, 10.79% reduction in memory access, and 99.64% reduction in MAC operations compared to the standard end-to-end BPTT.

preprint2020arXiv

A Non-Ideal NOMA-based mmWave D2D Networks with Hardware and CSI Imperfections

This letter investigates a non-orthogonal multiple access (NOMA) assisted millimeter-wave device-to-device (D2D) network practically limited by multiple interference noises, transceiver hardware impairments, imperfect successive interference cancellation, and channel state information mismatch. Generalized outage probability expressions for NOMA-D2D users are deduced and achieved results, validated by Monte Carlo simulations, are compared with the orthogonal multiple access to show the superior performance of the proposed network model

preprint2020arXiv

Hardware and Interference Limited Cooperative CR-NOMA Networks under Imperfect SIC and CSI

The conflation of cognitive radio (CR) and nonorthogonal multiple access (NOMA) concepts is a promising approach to fulfil the massive connectivity goals of future networks given the spectrum scarcity. Accordingly, this letter investigates the outage performance of imperfect cooperative CR-NOMA networks under hardware impairments and interference. Our analysis is involved with the derivation of the end-to-end outage probability (OP) for secondary NOMA users by accounting for imperfect channel state information (CSI), as well as the residual interference caused by successive interference cancellation (SIC) errors and coexisting primary/secondary users. The numerical results validated by Monte Carlo simulations show that CR-NOMA network provides a superior outage performance over orthogonal multiple access. As imperfections become more significant, CR-NOMA is observed to deliver relatively poor outage performance.

preprint2020arXiv

UAV-Assisted Cooperative & Cognitive NOMA: Deployment, Clustering, and Resource Allocation

Cooperative and cognitive non-orthogonal multiple access (CCR-NOMA) has been recognized as a promising technique to overcome issues of spectrum scarcity and support massive connectivity envisioned in next-generation wireless networks. In this paper, we investigate the deployment of an unmanned aerial vehicle (UAV) as a relay that fairly serves a large number of secondary users in a hot-spot region. The UAV deployment algorithm must jointly account for user clustering, channel assignment, and resource allocation sub-problems. We propose a solution methodology that obtains user clustering and channel assignment based on the optimal resource allocations for a given UAV location. To this end, we derive closed-form optimal power and time allocations and show it delivers optimal max-min fair throughput by consuming less energy and time than geometric programming. Based on optimal resource allocation, the optimal coverage probability is also provided in closed-form, which takes channel estimation errors, hardware impairments, and primary network interference into account. The optimal coverage probabilities are used by the proposed max-min fair user clustering and channel assignment approaches. The results show that the proposed method achieves 100% accuracy in more than five orders of magnitude less time than the optimal benchmark.

preprint2019arXiv

Performance Analysis and Enhancements for In-Band Full-Duplex Wireless Local Area Networks

In-Band Full-Duplex (IBFD) is a technique that enables a wireless node to simultaneously transmit a signal and receive another on the same assigned frequency. Thus, IBFD wireless systems can provide up to twice the channel capacity compared to conventional Half-Duplex (HD) systems. In order to study the feasibility of IBFD networks, reliable models are needed to capture anticipated benefits of IBFD above the physical layer (PHY). In this paper, an accurate analytical model based on Discrete-Time Markov Chain (DTMC) analysis for IEEE 802.11 Distributed Coordination Function (DCF) with IBFD capabilities is proposed. The model captures all parameters necessary to calculate important performance metrics which quantify enhancements introduced as a result of IBFD solutions. Additionally, two frame aggregation schemes for Wireless Local Area Networks (WLANs) with IBFD features are proposed to increase the efficiency of data transmission. Matching analytical and simulation results with less than 1% average errors confirm that the proposed frame aggregation schemes further improve the overall throughput by up to 24% and reduce latency by up to 47% in practical IBFD-WLANs. More importantly, the results assert that IBFD transmission can only reduce latency to a suboptimal point in WLANs, but frame aggregation is necessary to minimize it.