Source author record

Ahmed M. Eltawil

Ahmed M. Eltawil appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Hardware Architecture eess.SP Emerging Technologies Machine Learning Performance Artificial Intelligence Genomics Networking and Internet Architecture Neural and Evolutionary Computing

Catalog footprint

What is connected

19works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Sparsity-Aware Streaming SNN Accelerator with Output-Channel Dataflow for Automatic Modulation Classification

The rapid advancement of wireless communication technologies, including 5G, emerging 6G networks, and the large-scale deployment of the Internet of Things (IoT), has intensified the need for efficient spectrum utilization. Automatic modulation classification (AMC) plays a vital role in cognitive radio systems by enabling real-time identification of modulation schemes for dynamic spectrum access and interference mitigation. While deep neural networks (DNNs) offer high classification accuracy, their computational and energy demands pose challenges for real-time edge deployment. Spiking neural networks (SNNs), with their event-driven nature, offer inherent energy efficiency, but achieving both high throughput and low power under constrained hardware resources remains challenging. This work proposes a sparsity-aware SNN streaming accelerator optimized for AMC tasks. Unlike traditional systolic arrays that exploit sparsity but suffer from low throughput, or streaming architectures that achieve high throughput but cannot fully utilize input and weight sparsity, our design integrates both advantages. By leveraging the fixed nature of kernels during inference, we apply the gated one-to-all product (GOAP) algorithm to compute only on non-zero input-weight intersections. Extra or empty iterations are precomputed and embedded into the inference dataflow, eliminating dynamic data fetches and enabling fully pipelined, control-free inter-layer execution. Implemented on an FPGA, our sparsity-aware output-channel dataflow streaming (SAOCDS) accelerator achieves 23.5 MS/s (approximately double the baseline throughput) on the RadioML 2016 dataset, while reducing dynamic power and maintaining comparable classification accuracy. These results demonstrate strong potential for real-time, low-power deployment in edge cognitive radio systems.

preprint2023arXiv

Performance of RIS-empowered NOMA-based D2D Communication under Nakagami-m Fading

Reconfigurable intelligent surfaces (RISs) have sparked a renewed interest in the research community envisioning future wireless communication networks. In this study, we analyzed the performance of RIS-enabled non-orthogonal multiple access (NOMA) based device-to-device (D2D) wireless communication system, where the RIS is partitioned to serve a pair of D2D users. Specifically, closed-form expressions are derived for the upper and lower limits of spectral efficiency (SE) and energy efficiency (EE). In addition, the performance of the proposed NOMA-based system is also compared with its orthogonal counterpart. Extensive simulation is done to corroborate the analytical findings. The results demonstrate that RIS highly enhances the performance of a NOMA-based D2D network.

preprint2022arXiv

BackLink: Supervised Local Training with Backward Links

Empowered by the backpropagation (BP) algorithm, deep neural networks have dominated the race in solving various cognitive tasks. The restricted training pattern in the standard BP requires end-to-end error propagation, causing large memory cost and prohibiting model parallelization. Existing local training methods aim to resolve the training obstacle by completely cutting off the backward path between modules and isolating their gradients to reduce memory cost and accelerate the training process. These methods prevent errors from flowing between modules and hence information exchange, resulting in inferior performance. This work proposes a novel local training algorithm, BackLink, which introduces inter-module backward dependency and allows errors to flow between modules. The algorithm facilitates information to flow backward along with the network. To preserve the computational advantage of local training, BackLink restricts the error propagation length within the module. Extensive experiments performed in various deep convolutional neural networks demonstrate that our method consistently improves the classification performance of local training algorithms over other methods. For example, in ResNet32 with 16 local modules, our method surpasses the conventional greedy local training method by 4.00\% and a recent work by 1.83\% in accuracy on CIFAR10, respectively. Analysis of computational costs reveals that small overheads are incurred in GPU memory costs and runtime on multiple GPUs. Our method can lead up to a 79\% reduction in memory cost and 52\% in simulation runtime in ResNet110 compared to the standard BP. Therefore, our method could create new opportunities for improving training algorithms towards better efficiency and biological plausibility.

preprint2022arXiv

Configurable Independent Component Analysis Preprocessing Accelerator

Independent component analysis (ICA) has been used in many applications, including self-interference cancellation for in-band full-duplex wireless systems and anomaly detection in industrial internet of things. This paper presents a high-throughput and highly efficient configurable preprocessing accelerator for the ICA algorithm. The proposed ICA accelerator has three major blocks that perform data centering, covariance matrix for computation, and eigenvalue decomposition (EVD). Specifically, the proposed accelerator is based on a high-performance matrix multiplication array (MMA). The proposed MMA architecture uses time-multiplexed processing so that the efficiency of hardware utilization is greatly enhanced. Furthermore, the processing flow utilizes parallel processing such that the centering, the calculation of the covariance matrix, and EVD are conducted simultaneously and are individually pipelined to maximize throughput. This paper presents the architecture, circuit design, and performance estimates based on post-layout extraction of the proposed preprocessing ICA accelerator. The proposed design achieves a throughput of 40.7 kMatrices per second at complexity of 73.3 kGE.

preprint2022arXiv

DNA Pattern Matching Acceleration with Analog Resistive CAM

DNA pattern matching is essential for many widely used bioinformatics applications. Disease diagnosis is one of these applications, since analyzing changes in DNA sequences can increase our understanding of possible genetic diseases. The remarkable growth in the size of DNA datasets has resulted in challenges in discovering DNA patterns efficiently in terms of run time and power consumption. In this paper, we propose an efficient hardware and software codesign that determines the chance of the occurrence of repeat-expansion diseases using DNA pattern matching. The proposed design parallelizes the DNA pattern matching task using associative memory realized with analog content-addressable memory and implements an algorithm that returns the maximum number of consecutive occurrences of a specific pattern within a DNA sequence. We fully implement all the required hardware circuits with PTM 45-nm technology, and we evaluate the proposed architecture on a practical human DNA dataset. The results show that our design is energy-efficient and significantly accelerates the DNA pattern matching task compared to previous approaches described in the literature.

preprint2022arXiv

Efficient Analog CAM Design

Content Addressable Memories (CAMs) are considered a key-enabler for in-memory computing (IMC). IMC shows order of magnitude improvement in energy efficiency and throughput compared to traditional computing techniques. Recently, analog CAMs (aCAMs) were proposed as a means to improve storage density and energy efficiency. In this work, we propose two new aCAM cells to improve data encoding and robustness as compared to existing aCAM cells. We propose a methodology to choose the margin and interval width for data encoding. In addition, we perform a comprehensive comparison against prior work in terms of the number of intervals, noise sensitivity, dynamic range, energy, latency, area, and probability of failure.

preprint2022arXiv

In-memory Associative Processors: Tutorial, Potential, and Challenges

In-memory computing is an emerging computing paradigm that overcomes the limitations of exiting Von-Neumann computing architectures such as the memory-wall bottleneck. In such paradigm, the computations are performed directly on the data stored in the memory, which highly reduces the memory-processor communications during computation. Hence, significant speedup and energy savings could be achieved especially with data-intensive applications. Associative processors (APs) were proposed in the seventies and recently were revived thanks to the high-density memories. In this tutorial brief, we overview the functionalities and recent trends of APs in addition to the implementation of each content-addressable memory with different technologies. The AP operations and runtime complexity are also summarized. We also explain and explore the possible applications that can benefit from APs. Finally, the AP limitations, challenges, and future directions are discussed.

preprint2021arXiv

Deep Learning Based Frequency-Selective Channel Estimation for Hybrid mmWave MIMO Systems

Millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems typically employ hybrid mixed signal processing to avoid expensive hardware and high training overheads. {However, the lack of fully digital beamforming at mmWave bands imposes additional challenges in channel estimation. Prior art on hybrid architectures has mainly focused on greedy optimization algorithms to estimate frequency-flat narrowband mmWave channels, despite the fact that in practice, the large bandwidth associated with mmWave channels results in frequency-selective channels. In this paper, we consider a frequency-selective wideband mmWave system and propose two deep learning (DL) compressive sensing (CS) based algorithms for channel estimation.} The proposed algorithms learn critical apriori information from training data to provide highly accurate channel estimates with low training overhead. In the first approach, a DL-CS based algorithm simultaneously estimates the channel supports in the frequency domain, which are then used for channel reconstruction. The second approach exploits the estimated supports to apply a low-complexity multi-resolution fine-tuning method to further enhance the estimation performance. Simulation results demonstrate that the proposed DL-based schemes significantly outperform conventional orthogonal matching pursuit (OMP) techniques in terms of the normalized mean-squared error (NMSE), computational complexity, and spectral efficiency, particularly in the low signal-to-noise ratio regime. When compared to OMP approaches that achieve an NMSE gap of \$\unit[\{4-10\}]{dB}\$ with respect to the Cramer Rao Lower Bound (CRLB), the proposed algorithms reduce the CRLB gap to only \$\unit[\{1-1.5\}]{dB}\$, while significantly reducing complexity by two orders of magnitude.

preprint2021arXiv

Efficient Training of Spiking Neural Networks with Temporally-Truncated Local Backpropagation through Time

Directly training spiking neural networks (SNNs) has remained challenging due to complex neural dynamics and intrinsic non-differentiability in firing functions. The well-known backpropagation through time (BPTT) algorithm proposed to train SNNs suffers from large memory footprint and prohibits backward and update unlocking, making it impossible to exploit the potential of locally-supervised training methods. This work proposes an efficient and direct training algorithm for SNNs that integrates a locally-supervised training method with a temporally-truncated BPTT algorithm. The proposed algorithm explores both temporal and spatial locality in BPTT and contributes to significant reduction in computational cost including GPU memory utilization, main memory access and arithmetic operations. We thoroughly explore the design space concerning temporal truncation length and local training block size and benchmark their impact on classification accuracy of different networks running different types of tasks. The results reveal that temporal truncation has a negative effect on the accuracy of classifying frame-based datasets, but leads to improvement in accuracy on dynamic-vision-sensor (DVS) recorded datasets. In spite of resulting information loss, local training is capable of alleviating overfitting. The combined effect of temporal truncation and local training can lead to the slowdown of accuracy drop and even improvement in accuracy. In addition, training deep SNNs models such as AlexNet classifying CIFAR10-DVS dataset leads to 7.26% increase in accuracy, 89.94% reduction in GPU memory, 10.79% reduction in memory access, and 99.64% reduction in MAC operations compared to the standard end-to-end BPTT.

preprint2020arXiv

A Non-Ideal NOMA-based mmWave D2D Networks with Hardware and CSI Imperfections

This letter investigates a non-orthogonal multiple access (NOMA) assisted millimeter-wave device-to-device (D2D) network practically limited by multiple interference noises, transceiver hardware impairments, imperfect successive interference cancellation, and channel state information mismatch. Generalized outage probability expressions for NOMA-D2D users are deduced and achieved results, validated by Monte Carlo simulations, are compared with the orthogonal multiple access to show the superior performance of the proposed network model

preprint2020arXiv

Hardware and Interference Limited Cooperative CR-NOMA Networks under Imperfect SIC and CSI

The conflation of cognitive radio (CR) and nonorthogonal multiple access (NOMA) concepts is a promising approach to fulfil the massive connectivity goals of future networks given the spectrum scarcity. Accordingly, this letter investigates the outage performance of imperfect cooperative CR-NOMA networks under hardware impairments and interference. Our analysis is involved with the derivation of the end-to-end outage probability (OP) for secondary NOMA users by accounting for imperfect channel state information (CSI), as well as the residual interference caused by successive interference cancellation (SIC) errors and coexisting primary/secondary users. The numerical results validated by Monte Carlo simulations show that CR-NOMA network provides a superior outage performance over orthogonal multiple access. As imperfections become more significant, CR-NOMA is observed to deliver relatively poor outage performance.

preprint2020arXiv

UAV-Assisted Cooperative & Cognitive NOMA: Deployment, Clustering, and Resource Allocation

Cooperative and cognitive non-orthogonal multiple access (CCR-NOMA) has been recognized as a promising technique to overcome issues of spectrum scarcity and support massive connectivity envisioned in next-generation wireless networks. In this paper, we investigate the deployment of an unmanned aerial vehicle (UAV) as a relay that fairly serves a large number of secondary users in a hot-spot region. The UAV deployment algorithm must jointly account for user clustering, channel assignment, and resource allocation sub-problems. We propose a solution methodology that obtains user clustering and channel assignment based on the optimal resource allocations for a given UAV location. To this end, we derive closed-form optimal power and time allocations and show it delivers optimal max-min fair throughput by consuming less energy and time than geometric programming. Based on optimal resource allocation, the optimal coverage probability is also provided in closed-form, which takes channel estimation errors, hardware impairments, and primary network interference into account. The optimal coverage probabilities are used by the proposed max-min fair user clustering and channel assignment approaches. The results show that the proposed method achieves 100% accuracy in more than five orders of magnitude less time than the optimal benchmark.

preprint2019arXiv

Performance Analysis and Enhancements for In-Band Full-Duplex Wireless Local Area Networks

In-Band Full-Duplex (IBFD) is a technique that enables a wireless node to simultaneously transmit a signal and receive another on the same assigned frequency. Thus, IBFD wireless systems can provide up to twice the channel capacity compared to conventional Half-Duplex (HD) systems. In order to study the feasibility of IBFD networks, reliable models are needed to capture anticipated benefits of IBFD above the physical layer (PHY). In this paper, an accurate analytical model based on Discrete-Time Markov Chain (DTMC) analysis for IEEE 802.11 Distributed Coordination Function (DCF) with IBFD capabilities is proposed. The model captures all parameters necessary to calculate important performance metrics which quantify enhancements introduced as a result of IBFD solutions. Additionally, two frame aggregation schemes for Wireless Local Area Networks (WLANs) with IBFD features are proposed to increase the efficiency of data transmission. Matching analytical and simulation results with less than 1% average errors confirm that the proposed frame aggregation schemes further improve the overall throughput by up to 24% and reduce latency by up to 47% in practical IBFD-WLANs. More importantly, the results assert that IBFD transmission can only reduce latency to a suboptimal point in WLANs, but frame aggregation is necessary to minimize it.

preprint2014arXiv

All-Digital Self-interference Cancellation Technique for Full-duplex Systems

Full-duplex systems are expected to double the spectral efficiency compared to conventional half-duplex systems if the self-interference signal can be significantly mitigated. Digital cancellation is one of the lowest complexity self-interference cancellation techniques in full-duplex systems. However, its mitigation capability is very limited, mainly due to transmitter and receiver circuit's impairments. In this paper, we propose a novel digital self-interference cancellation technique for full-duplex systems. The proposed technique is shown to significantly mitigate the self-interference signal as well as the associated transmitter and receiver impairments. In the proposed technique, an auxiliary receiver chain is used to obtain a digital-domain copy of the transmitted Radio Frequency (RF) self-interference signal. The self-interference copy is then used in the digital-domain to cancel out both the self-interference signal and the associated impairments. Furthermore, to alleviate the receiver phase noise effect, a common oscillator is shared between the auxiliary and ordinary receiver chains. A thorough analytical and numerical analysis for the effect of the transmitter and receiver impairments on the cancellation capability of the proposed technique is presented. Finally, the overall performance is numerically investigated showing that using the proposed technique, the self-interference signal could be mitigated to ~3dB higher than the receiver noise floor, which results in up to 76% rate improvement compared to conventional half-duplex systems at 20dBm transmit power values.

preprint2014arXiv

Full-Duplex Systems Using Multi-Reconfigurable Antennas

Full-duplex systems are expected to achieve 100% rate improvement over half-duplex systems if the self-interference signal can be significantly mitigated. In this paper, we propose the first full-duplex system utilizing Multi-Reconfigurable Antenna (MRA) with ?90% rate improvement compared to half-duplex systems. MRA is a dynamically reconfigurable antenna structure, that is capable of changing its properties according to certain input configurations. A comprehensive experimental analysis is conducted to characterize the system performance in typical indoor environments. The experiments are performed using a fabricated MRA that has 4096 configurable radiation patterns. The achieved MRA-based passive self-interference suppression is investigated, with detailed analysis for the MRA training overhead. In addition, a heuristic-based approach is proposed to reduce the MRA training overhead. The results show that at 1% training overhead, a total of 95dB self-interference cancellation is achieved in typical indoor environments. The 95dB self-interference cancellation is experimentally shown to be sufficient for 90% full-duplex rate improvement compared to half-duplex systems.

preprint2014arXiv

On Phase Noise Suppression in Full-Duplex Systems

Oscillator phase noise has been shown to be one of the main performance limiting factors in full-duplex systems. In this paper, we consider the problem of self-interference cancellation with phase noise suppression in full-duplex systems. The feasibility of performing phase noise suppression in full-duplex systems in terms of both complexity and achieved gain is analytically and experimentally investigated. First, the effect of phase noise on full-duplex systems and the possibility of performing phase noise suppression are studied. Two different phase noise suppression techniques with a detailed complexity analysis are then proposed. For each suppression technique, both free-running and phase locked loop based oscillators are considered. Due to the fact that full-duplex system performance highly depends on hardware impairments, experimental analysis is essential for reliable results. In this paper, the performance of the proposed techniques is experimentally investigated in a typical indoor environment. The experimental results are shown to confirm the results obtained from numerical simulations on two different experimental research platforms. At the end, the tradeoff between the required complexity and the gain achieved using phase noise suppression is discussed.

preprint2014arXiv

State Dependent Statistical Timing Model for Voltage Scaled Circuits

This paper presents a novel statistical state-dependent timing model for voltage over scaled (VoS) logic circuits that accurately and rapidly finds the timing distribution of output bits. Using this model erroneous VoS circuits can be represented as error-free circuits combined with an error-injector. A case study of a two point DFT unit employing the proposed model is presented and compared to HSPICE circuit simulation. Results show an accurate match, with significant speedup gains.

preprint2013arXiv

Self-Interference Cancellation with Nonlinear Distortion Suppression for Full-Duplex Systems

In full-duplex systems, due to the strong self-interference signal, system nonlinearities become a significant limiting factor that bounds the possible cancellable self-interference power. In this paper, a self-interference cancellation scheme for full-duplex orthogonal frequency division multiplexing systems is proposed. The proposed scheme increases the amount of cancellable self-interference power by suppressing the distortion caused by the transmitter and receiver nonlinearities. An iterative technique is used to jointly estimate the self-interference channel and the nonlinearity coefficients required to suppress the distortion signal. The performance is numerically investigated showing that the proposed scheme achieves a performance that is less than 0.5dB off the performance of a linear full-duplex system.

preprint2013arXiv

Self-Interference Cancellation with Phase Noise Induced ICI Suppression for Full-Duplex Systems

One of the main bottlenecks in practical full-duplex systems is the oscillator phase noise, which bounds the possible cancellable self-interference power. In this paper, a digitaldomain self-interference cancellation scheme for full-duplex orthogonal frequency division multiplexing systems is proposed. The proposed scheme increases the amount of cancellable selfinterference power by suppressing the effect of both transmitter and receiver oscillator phase noise. The proposed scheme consists of two main phases, an estimation phase and a cancellation phase. In the estimation phase, the minimum mean square error estimator is used to jointly estimate the transmitter and receiver phase noise associated with the incoming self-interference signal. In the cancellation phase, the estimated phase noise is used to suppress the intercarrier interference caused by the phase noise associated with the incoming self-interference signal. The performance of the proposed scheme is numerically investigated under different operating conditions. It is demonstrated that the proposed scheme could achieve up to 9dB more self-interference cancellation than the existing digital-domain cancellation schemes that ignore the intercarrier interference suppression.

Ahmed M. Eltawil

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Sparsity-Aware Streaming SNN Accelerator with Output-Channel Dataflow for Automatic Modulation Classification

Performance of RIS-empowered NOMA-based D2D Communication under Nakagami-m Fading

BackLink: Supervised Local Training with Backward Links

Configurable Independent Component Analysis Preprocessing Accelerator

DNA Pattern Matching Acceleration with Analog Resistive CAM

Efficient Analog CAM Design

In-memory Associative Processors: Tutorial, Potential, and Challenges

Deep Learning Based Frequency-Selective Channel Estimation for Hybrid mmWave MIMO Systems

Efficient Training of Spiking Neural Networks with Temporally-Truncated Local Backpropagation through Time

A Non-Ideal NOMA-based mmWave D2D Networks with Hardware and CSI Imperfections

Hardware and Interference Limited Cooperative CR-NOMA Networks under Imperfect SIC and CSI

UAV-Assisted Cooperative & Cognitive NOMA: Deployment, Clustering, and Resource Allocation

Performance Analysis and Enhancements for In-Band Full-Duplex Wireless Local Area Networks

All-Digital Self-interference Cancellation Technique for Full-duplex Systems

Full-Duplex Systems Using Multi-Reconfigurable Antennas

On Phase Noise Suppression in Full-Duplex Systems

State Dependent Statistical Timing Model for Voltage Scaled Circuits

Self-Interference Cancellation with Nonlinear Distortion Suppression for Full-Duplex Systems

Self-Interference Cancellation with Phase Noise Induced ICI Suppression for Full-Duplex Systems