Source author record

Wassim Hamidouche

Wassim Hamidouche appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.IV Computer Vision eess.SP Cryptography and Security Neural and Evolutionary Computing Machine Learning Computation and Language Distributed, Parallel, and Cluster Computing Multimedia Neurons and Cognition Social and Information Networks

Catalog footprint

What is connected

20works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

BYOL: Bring Your Own Language Into LLMs

Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .

preprint2026arXiv

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

preprint2022arXiv

Adversarial Example Detection for DNN Models: A Review and Experimental Comparison

Deep learning (DL) has shown great success in many human-related tasks, which has led to its adoption in many computer vision based applications, such as security surveillance systems, autonomous vehicles and healthcare. Such safety-critical applications have to draw their path to success deployment once they have the capability to overcome safety-critical challenges. Among these challenges are the defense against or/and the detection of the adversarial examples (AEs). Adversaries can carefully craft small, often imperceptible, noise called perturbations to be added to the clean image to generate the AE. The aim of AE is to fool the DL model which makes it a potential risk for DL applications. Many test-time evasion attacks and countermeasures,i.e., defense or detection methods, are proposed in the literature. Moreover, few reviews and surveys were published and theoretically showed the taxonomy of the threats and the countermeasure methods with little focus in AE detection methods. In this paper, we focus on image classification task and attempt to provide a survey for detection methods of test-time evasion attacks on neural network classifiers. A detailed discussion for such methods is provided with experimental results for eight state-of-the-art detectors under different scenarios on four datasets. We also provide potential challenges and future perspectives for this research direction.

preprint2022arXiv

CAESR: Conditional Autoencoder and Super-Resolution for Learned Spatial Scalability

In this paper, we present CAESR, an hybrid learning-based coding approach for spatial scalability based on the versatile video coding (VVC) standard. Our framework considers a low-resolution signal encoded with VVC intra-mode as a base-layer (BL), and a deep conditional autoencoder with hyperprior (AE-HP) as an enhancement-layer (EL) model. The EL encoder takes as inputs both the upscaled BL reconstruction and the original image. Our approach relies on conditional coding that learns the optimal mixture of the source and the upscaled BL image, enabling better performance than residual coding. On the decoder side, a super-resolution (SR) module is used to recover high-resolution details and invert the conditional coding process. Experimental results have shown that our solution is competitive with the VVC full-resolution intra coding while being scalable.

preprint2022arXiv

Deep-based Film Grain Removal and Synthesis

In this paper, deep learning-based techniques for film grain removal and synthesis that can be applied in video coding are proposed. Film grain is inherent in analog film content because of the physical process of capturing images and video on film. It can also be present in digital content where it is purposely added to reflect the era of analog film and to evoke certain emotions in the viewer or enhance the perceived quality. In the context of video coding, the random nature of film grain makes it both difficult to preserve and very expensive to compress. To better preserve it while compressing the content efficiently, film grain is removed and modeled before video encoding and then restored after video decoding. In this paper, a film grain removal model based on an encoder-decoder architecture and a film grain synthesis model based on a \ac{cgan} are proposed. Both models are trained on a large dataset of pairs of clean (grain-free) and grainy images. Quantitative and qualitative evaluations of the developed solutions were conducted and showed that the proposed film grain removal model is effective in filtering film grain at different intensity levels using two configurations: 1) a non-blind configuration where the film grain level of the grainy input is known and provided as input, 2) a blind configuration where the film grain level is unknown. As for the film grain synthesis task, the experimental results show that the proposed model is able to reproduce realistic film grain with a controllable intensity level specified as input.

preprint2022arXiv

Federated Adversarial Training with Transformers

Federated learning (FL) has emerged to enable global model training over distributed clients' data while preserving its privacy. However, the global trained model is vulnerable to the evasion attacks especially, the adversarial examples (AEs), carefully crafted samples to yield false classification. Adversarial training (AT) is found to be the most promising approach against evasion attacks and it is widely studied for convolutional neural network (CNN). Recently, vision transformers have been found to be effective in many computer vision tasks. To the best of the authors' knowledge, there is no work that studied the feasibility of AT in a FL process for vision transformers. This paper investigates such feasibility with different federated model aggregation methods and different vision transformer models with different tokenization and classification head techniques. In order to improve the robust accuracy of the models with the not independent and identically distributed (Non-IID), we propose an extension to FedAvg aggregation method, called FedWAvg. By measuring the similarities between the last layer of the global model and the last layer of the client updates, FedWAvg calculates the weights to aggregate the local models updates. The experiments show that FedWAvg improves the robust accuracy when compared with other state-of-the-art aggregation methods.

preprint2022arXiv

OpenVVC: a Lightweight Software Decoder for the Versatile Video Coding Standard

In the recent years, users requirements for higher resolution, coupled with the apparition of new multimedia applications, have created the need for a new video coding standard. The new generation video coding standard, called Versatile Video Coding (VVC), has been developed by the Joint Video Experts Team, and offers coding capability beyond the previous generation High Efficiency Video Coding (HEVC) standard. Due to the incorporation of more advanced and complex tools, the decoding complexity of VVC standard compared to HEVC has approximately doubled. This complexity increase raises new research challenges to achieve live software decoding. In this context, we developed OpenVVC, an open-source software decoder that supports a broad range of VVC functionalities. This paper presents the OpenVVC software architecture, its parallelism strategy as well as a detailed set of experimental results. By combining extensive data level parallelism with frame level parallelism, OpenVVC achieves real-time decoding of UHD video content. Moreover, the memory required by OpenVVC is remarkably low, which presents a great advantage for its integration on embedded platforms with low memory resources. The code of the OpenVVC decoder is publicly available at https://github.com/OpenVVC/OpenVVC

preprint2022arXiv

Performance Analysis of Optimized Versatile Video Coding Software Decoders on Embedded Platforms

In recent years, the global demand for high-resolution videos and the emergence of new multimedia applications have created the need for a new video coding standard. Hence, in July 2020 the Versatile Video Coding (VVC) standard was released providing up to 50% bit-rate saving for the same video quality compared to its predecessor High Efficiency Video Coding (HEVC). However, this bit-rate saving comes at the cost of a high computational complexity, particularly for live applications and on resource-constraint embedded devices. This paper presents two optimized VVC software decoders, named OpenVVC and Versatile Video deCoder (VVdeC), designed for low resources platforms. They exploit optimization techniques such as data level parallelism using Single Instruction Multiple Data (SIMD) instructions and functional level parallelism using frame, tile and slice-based parallelisms. Furthermore, a comparison in terms of decoding run time, energy and memory consumption between the two decoders is presented while targeting two different resource-constraint embedded devices. The results showed that both decoders achieve real-time decoding of Full High definition (FHD) resolution over the first platform using 8 cores and High-definition (HD) real-time decoding for the second platform using only 4 cores with comparable results in terms of average consumed energy: around 26 J and 15 J for the 8 cores and 4 cores embedded platforms, respectively. Regarding the memory usage, OpenVVC showed better results with less average maximum memory consumed during run time compared to VVdeC.

preprint2022arXiv

Transformer based Models for Unsupervised Anomaly Segmentation in Brain MR Images

The quality of patient care associated with diagnostic radiology is proportionate to a physician workload. Segmentation is a fundamental limiting precursor to both diagnostic and therapeutic procedures. Advances in machine learning (ML) aim to increase diagnostic efficiency by replacing a single application with generalized algorithms. The goal of unsupervised anomaly detection (UAD) is to identify potential anomalous regions unseen during training, where convolutional neural network (CNN) based autoencoders (AEs) and variational autoencoders (VAEs) are considered a de facto approach for reconstruction based-anomaly segmentation. The restricted receptive field in CNNs limits the CNN to model the global context. Hence, if the anomalous regions cover large parts of the image, the CNN-based AEs are not capable of bringing a semantic understanding of the image. Meanwhile, vision transformers (ViTs) have emerged as a competitive alternative to CNNs. It relies on the self-attention mechanism that can relate image patches to each other. We investigate in this paper Transformer capabilities in building AEs for the reconstruction-based UAD task to reconstruct a coherent and more realistic image. We focus on anomaly segmentation for brain magnetic resonance imaging (MRI) and present five Transformer-based models while enabling segmentation performance comparable to or superior to state-of-the-art (SOTA) models. The source code is made publicly available on GitHub: https://github.com/ahmedgh970/Transformers_Unsupervised_Anomaly_Segmentation.git.

preprint2021arXiv

Light Field Image Coding Using VVC standard and View Synthesis based on Dual Discriminator GAN

Light field (LF) technology is considered as a promising way for providing a high-quality virtual reality (VR) content. However, such an imaging technology produces a large amount of data requiring efficient LF image compression solutions. In this paper, we propose a LF image coding method based on a view synthesis and view quality enhancement techniques. Instead of transmitting all the LF views, only a sparse set of reference views are encoded and transmitted, while the remaining views are synthesized at the decoder side. The transmitted views are encoded using the versatile video coding (VVC) standard and are used as reference views to synthesize the dropped views. The selection of non-reference dropped views is performed using a rate-distortion optimization based on the VVC temporal scalability. The dropped views are reconstructed using the LF dual discriminator GAN (LF-D2GAN) model. In addition, to ensure that the quality of the views is consistent, at the decoder, a quality enhancement procedure is performed on the reconstructed views allowing smooth navigation across views. Experimental results show that the proposed method provides high coding performance and overcomes the state-of-the-art LF image compression methods by -36.22% in terms of BD-BR and 1.35 dB in BD-PSNR. The web page of this work is available at https://naderbakir79.github.io/LFD2GAN.html.

preprint2021arXiv

Selective Encryption of the Versatile Video Coding Standard

Versatile video coding (VVC) is the next generation video coding standard developed by the joint video experts team (JVET) and released in July 2020. VVC introduces several new coding tools providing a significant coding gain over the high efficiency video coding (HEVC) standard. It is well known that increasing the coding efficiency adds more dependencies in the video bitstream making format-compliant encryption with the standard more challenging. In this paper we tackle the problem of selective encryption of the VVC standard in format-compliant and constant bitrate. These two constraints ensure that the encrypted bitstream can be decoded by any VVC decoder while the bitrate remains unchanged by the encryption. The selective encryption of all possible VVC syntax elements is investigated. A new algorithm is proposed to encrypt in format-compliant and constant bitrate the transform coefficients (TCs) together with other syntax elements at the level of the entropy encoder. The proposed solution was integrated and assessed under the VVC reference software model version 6.0. Experimental results showed that the encryption drastically decreases the video quality while the encryption is robust against several types of attacks. The encryption space is estimated in the range of 15% to 26% of the bitstream size resulting in a lightweight encryption process. The web page of this work is available at https://gugautie.github.io/sevvc/.

preprint2020arXiv

A Fixation-based 360° Benchmark Dataset for Salient Object Detection

Fixation prediction (FP) in panoramic contents has been widely investigated along with the booming trend of virtual reality (VR) applications. However, another issue within the field of visual saliency, salient object detection (SOD), has been seldom explored in 360° (or omnidirectional) images due to the lack of datasets representative of real scenes with pixel-level annotations. Toward this end, we collect 107 equirectangular panoramas with challenging scenes and multiple object classes. Based on the consistency between FP and explicit saliency judgements, we further manually annotate 1,165 salient objects over the collected images with precise masks under the guidance of real human eye fixation maps. Six state-of-the-art SOD models are then benchmarked on the proposed fixation-based 360° image dataset (F-360iSOD), by applying a multiple cubic projection-based fine-tuning method. Experimental results show a limitation of the current methods when used for SOD in panoramic images, which indicates the proposed dataset is challenging. Key issues for 360° SOD is also discussed. The proposed dataset is available at https://github.com/PanoAsh/F-360iSOD.

preprint2020arXiv

Binary Probability Model for Learning Based Image Compression

In this paper, we propose to enhance learned image compression systems with a richer probability model for the latent variables. Previous works model the latents with a Gaussian or a Laplace distribution. Inspired by binary arithmetic coding , we propose to signal the latents with three binary values and one integer, with different probability models. A relaxation method is designed to perform gradient-based training. The richer probability model results in a better entropy coding leading to lower rate. Experiments under the Challenge on Learned Image Compression (CLIC) test conditions demonstrate that this method achieves 18% rate saving compared to Gaussian or Laplace models.

preprint2020arXiv

Extending 2D Saliency Models for Head Movement Prediction in 360-degree Images using CNN-based Fusion

Saliency prediction can be of great benefit for 360-degree image/video applications, including compression, streaming , rendering and viewpoint guidance. It is therefore quite natural to adapt the 2D saliency prediction methods for 360-degree images. To achieve this, it is necessary to project the 360-degree image to 2D plane. However, the existing projection techniques introduce different distortions, which provides poor results and makes inefficient the direct application of 2D saliency prediction models to 360-degree content. Consequently, in this paper, we propose a new framework for effectively applying any 2D saliency prediction method to 360-degree images. The proposed framework particularly includes a novel convolutional neural network based fusion approach that provides more accurate saliency prediction while avoiding the introduction of distortions. The proposed framework has been evaluated with five 2D saliency prediction methods, and the experimental results showed the superiority of our approach compared to the use of weighted sum or pixel-wise maximum fusion methods.

preprint2020arXiv

Light Field Image Coding Using Dual Discriminator Generative Adversarial Network and VVC Temporal Scalability

Light field technology represents a viable path for providing a high-quality VR content. However, such an imaging system generates a high amount of data leading to an urgent need for LF image compression solution. In this paper, we propose an efficient LF image coding scheme based on view synthesis. Instead of transmitting all the LF views, only some of them are coded and transmitted, while the remaining views are dropped. The transmitted views are coded using Versatile Video Coding (VVC) and used as reference views to synthesize the missing views at decoder side. The dropped views are generated using the efficient dual discriminator GAN model. The selection of reference/dropped views is performed using a rate distortion optimization based on the VVC temporal scalability. Experimental results show that the proposed method provides high coding performance and overcomes the state-of-the-art LF image compression solutions.

preprint2020arXiv

ModeNet: Mode Selection Network For Learned Video Coding

In this paper, a mode selection network (ModeNet) is proposed to enhance deep learning-based video compression. Inspired by traditional video coding, ModeNet purpose is to enable competition among several coding modes. The proposed ModeNet learns and conveys a pixel-wise partitioning of the frame, used to assign each pixel to the most suited coding mode. ModeNet is trained alongside the different coding modes to minimize a rate-distortion cost. It is a flexible component which can be generalized to other systems to allow competition between different coding tools. Mod-eNet interest is studied on a P-frame coding task, where it is used to design a method for coding a frame given its prediction. ModeNet-based systems achieve compelling performance when evaluated under the Challenge on Learned Image Compression 2020 (CLIC20) P-frame coding track conditions.

preprint2020arXiv

Optical Flow and Mode Selection for Learning-based Video Coding

This paper introduces a new method for inter-frame coding based on two complementary autoencoders: MOFNet and CodecNet. MOFNet aims at computing and conveying the Optical Flow and a pixel-wise coding Mode selection. The optical flow is used to perform a prediction of the frame to code. The coding mode selection enables competition between direct copy of the prediction or transmission through CodecNet. The proposed coding scheme is assessed under the Challenge on Learned Image Compression 2020 (CLIC20) P-frame coding conditions, where it is shown to perform on par with the state-of-the-art video codec ITU/MPEG HEVC. Moreover, the possibility of copying the prediction enables to learn the optical flow in an end-to-end fashion i.e. without relying on pre-training and/or a dedicated loss term.

preprint2020arXiv

Quality-Driven Dynamic VVC Frame Partitioning for Efficient Parallel Processing

VVC is the next generation video coding standard, offering coding capability beyond HEVC standard. The high computational complexity of the latest video coding standards requires high-level parallelism techniques, in order to achieve real-time and low latency encoding and decoding. HEVC and VVC include tile grid partitioning that allows to process simultaneously rectangular regions of a frame with independent threads. The tile grid may be further partitioned into a horizontal sub-grid of Rectangular Slices (RSs), increasing the partitioning flexibility. The dynamic Tile and Rectangular Slice (TRS) partitioning solution proposed in this paper benefits from this flexibility. The TRS partitioning is carried-out at the frame level, taking into account both spatial texture of the content and encoding times of previously encoded frames. The proposed solution searches the best partitioning configuration that minimizes the trade-off between multi-thread encoding time and encoding quality loss. Experiments prove that the proposed solution, compared to uniform TRS partitioning, significantly decreases multi-thread encoding time, with slightly better encoding quality.

preprint2020arXiv

Quality-driven Variable Frame-Rate for Green Video Coding in Broadcast Applications

The Digital Video Broadcasting (DVB) has proposed to introduce the Ultra-High Definition services in three phases: UHD-1 phase 1, UHD-1 phase 2 and UHD-2. The UHD-1 phase 2 specification includes several new features such as High Dynamic Range (HDR) and High Frame-Rate (HFR). It has been shown in several studies that HFR (+100 fps) enhances the perceptual quality and that this quality enhancement is content-dependent. On the other hand, HFR brings several challenges to the transmission chain including codec complexity increase and bit-rate overhead, which may delay or even prevent its deployment in the broadcast echo-system. In this paper, we propose a Variable Frame Rate (VFR) solution to determine the minimum (critical) frame-rate that preserves the perceived video quality of HFR video. The frame-rate determination is modeled as a 3-class classification problem which consists in dynamically and locally selecting one frame-rate among three: 30, 60 and 120 frames per second. Two random forests classifiers are trained with a ground truth carefully built by experts for this purpose. The subjective results conducted on ten HFR video contents, not included in the training set, clearly show the efficiency of the proposed solution enabling to locally determine the lowest possible frame-rate while preserving the quality of the HFR content. Moreover, our VFR solution enables significant bit-rate savings and complexity reductions at both encoder and decoder sides.

preprint2020arXiv

Versatile video coding and super-resolution for efficient delivery of 8K video with 4K backward-compatibility

In this paper, we propose, through an objective study, to compare and evaluate the performance of different coding approaches allowing the delivery of an 8K video signal with 4K backward-compatibility on broadcast networks. Presented approaches include simulcast of 8K and 4K single-layer signals encoded using High-Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) standards, spatial scalability using SHVC with 4K base layer (BL) and 8K enhancement-layer (EL), and super-resolution applied on 4K VVC signal after decoding to reach 8K resolution. For up-scaling, we selected the deep-learning-based super-resolution method called Super-Resolution with Feedback Network (SRFBN) and the Lanczos interpolation filter. We show that the deep-learning-based approach achieves visual quality gain over simulcast, especially on bit-rates lower than 30Mb/s with average gain of 0.77dB, 0.015, and 7.97 for PSNR, SSIM, and VMAF, respectively and out-performs the Lanczos filter in average by 29% of BD-rate savings.

Wassim Hamidouche

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

BYOL: Bring Your Own Language Into LLMs

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

Adversarial Example Detection for DNN Models: A Review and Experimental Comparison

CAESR: Conditional Autoencoder and Super-Resolution for Learned Spatial Scalability

Deep-based Film Grain Removal and Synthesis

Federated Adversarial Training with Transformers

OpenVVC: a Lightweight Software Decoder for the Versatile Video Coding Standard

Performance Analysis of Optimized Versatile Video Coding Software Decoders on Embedded Platforms

Transformer based Models for Unsupervised Anomaly Segmentation in Brain MR Images

Light Field Image Coding Using VVC standard and View Synthesis based on Dual Discriminator GAN

Selective Encryption of the Versatile Video Coding Standard

A Fixation-based 360° Benchmark Dataset for Salient Object Detection

Binary Probability Model for Learning Based Image Compression

Extending 2D Saliency Models for Head Movement Prediction in 360-degree Images using CNN-based Fusion

Light Field Image Coding Using Dual Discriminator Generative Adversarial Network and VVC Temporal Scalability

ModeNet: Mode Selection Network For Learned Video Coding

Optical Flow and Mode Selection for Learning-based Video Coding

Quality-Driven Dynamic VVC Frame Partitioning for Efficient Parallel Processing

Quality-driven Variable Frame-Rate for Green Video Coding in Broadcast Applications

Versatile video coding and super-resolution for efficient delivery of 8K video with 4K backward-compatibility