Source author record

Victor Sanchez

Victor Sanchez appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Multimedia eess.IV Machine Learning Artificial Intelligence Data Structures and Algorithms eess.AS Sound

Catalog footprint

What is connected

15works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

OpenSocInt: A Multi-modal Training Environment for Human-Aware Social Navigation

In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at https://gitlab.inria.fr/robotlearn/OpenSocInt/.

preprint2026arXiv

TAPM-Net: Trajectory-Aware Perturbation Modeling for Infrared Small Target Detection

Infrared small target detection (ISTD) remains a long-standing challenge due to weak signal contrast, limited spatial extent, and cluttered backgrounds. Despite performance improvements from convolutional neural networks (CNNs) and Vision Transformers (ViTs), current models lack a mechanism to trace how small targets trigger directional, layer-wise perturbations in the feature space, which is an essential cue for distinguishing signal from structured noise in infrared scenes. To address this limitation, we propose the Trajectory-Aware Mamba Propagation Network (TAPM-Net), which explicitly models the spatial diffusion behavior of target-induced feature disturbances. TAPM-Net is built upon two novel components: a Perturbation-guided Path Module (PGM) and a Trajectory-Aware State Block (TASB). The PGM constructs perturbation energy fields from multi-level features and extracts gradient-following feature trajectories that reflect the directionality of local responses. The resulting feature trajectories are fed into the TASB, a Mamba-based state-space unit that models dynamic propagation along each trajectory while incorporating velocity-constrained diffusion and semantically aligned feature fusion from word-level and sentence-level embeddings. Unlike existing attention-based methods, TAPM-Net enables anisotropic, context-sensitive state transitions along spatial trajectories while maintaining global coherence at low computational cost. Experiments on NUAA-SIRST and IRSTD-1K demonstrate that TAPM-Net achieves state-of-the-art performance in ISTD.

preprint2024arXiv

Cross-Age Contrastive Learning for Age-Invariant Face Recognition

Cross-age facial images are typically challenging and expensive to collect, making noise-free age-oriented datasets relatively small compared to widely-used large-scale facial datasets. Additionally, in real scenarios, images of the same subject at different ages are usually hard or even impossible to obtain. Both of these factors lead to a lack of supervised data, which limits the versatility of supervised methods for age-invariant face recognition, a critical task in applications such as security and biometrics. To address this issue, we propose a novel semi-supervised learning approach named Cross-Age Contrastive Learning (CACon). Thanks to the identity-preserving power of recent face synthesis models, CACon introduces a new contrastive learning method that leverages an additional synthesized sample from the input image. We also propose a new loss function in association with CACon to perform contrastive learning on a triplet of samples. We demonstrate that our method not only achieves state-of-the-art performance in homogeneous-dataset experiments on several age-invariant face recognition benchmarks but also outperforms other methods by a large margin in cross-dataset experiments.

preprint2022arXiv

Spectral-PQ: A Novel Spectral Sensitivity-Orientated Perceptual Compression Technique for RGB 4:4:4 Video Data

There exists an intrinsic relationship between the spectral sensitivity of the Human Visual System (HVS) and colour perception; these intertwined phenomena are often overlooked in perceptual compression research. In general, most previously proposed visually lossless compression techniques exploit luminance (luma) masking including luma spatiotemporal masking, luma contrast masking and luma texture/edge masking. The perceptual relevance of color in a picture is often overlooked, which constitutes a gap in the literature. With regard to the spectral sensitivity phenomenon of the HVS, the color channels of raw RGB 4:4:4 data contain significant color-based psychovisual redundancies. These perceptual redundancies can be quantized via color channel-level perceptual quantization. In this paper, we propose a novel spatiotemporal visually lossless coding method named Spectral Perceptual Quantization (Spectral-PQ). With application for RGB 4:4:4 video data, Spectral-PQ exploits HVS spectral sensitivity-related color masking in addition to spatial masking and temporal masking; the proposed method operates at the Coding Block (CB) level and the Prediction Unit (PU) level in the HEVC standard. Spectral-PQ perceptually adjusts the Quantization Step Size (QStep) at the CB level if high variance spatial data in G, B and R CBs is detected and also if high motion vector magnitudes in PUs are detected. Compared with anchor 1 (HEVC HM 16.17 RExt), Spectral-PQ considerably reduces bitrates with a maximum reduction of approximately 81%. The Mean Opinion Score (MOS) in the subjective evaluations show that Spectral-PQ successfully achieves perceptually lossless quality.

preprint2022arXiv

Video Anomaly Detection via Prediction Network with Enhanced Spatio-Temporal Memory Exchange

Video anomaly detection is a challenging task because most anomalies are scarce and non-deterministic. Many approaches investigate the reconstruction difference between normal and abnormal patterns, but neglect that anomalies do not necessarily correspond to large reconstruction errors. To address this issue, we design a Convolutional LSTM Auto-Encoder prediction framework with enhanced spatio-temporal memory exchange using bi-directionalilty and a higher-order mechanism. The bi-directional structure promotes learning the temporal regularity through forward and backward predictions. The unique higher-order mechanism further strengthens spatial information interaction between the encoder and the decoder. Considering the limited receptive fields in Convolutional LSTMs, we also introduce an attention module to highlight informative features for prediction. Anomalies are eventually identified by comparing the frames with their corresponding predictions. Evaluations on three popular benchmarks show that our framework outperforms most existing prediction-based anomaly detection methods.

preprint2022arXiv

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent detailed information about the underlying signal. Using heterogeneous graph approaches to address the task of visually-aware acoustic event classification, which serves as a compact, efficient and scalable way to represent data in the form of graphs. Through heterogeneous graphs, we show efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales. Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on AudioSet, a large benchmark, shows that our model achieves state-of-the-art performance.

preprint2021arXiv

HVS-Based Perceptual Color Compression of Image Data

In perceptual image coding applications, the main objective is to decrease, as much as possible, Bits Per Pixel (BPP) while avoiding noticeable distortions in the reconstructed image. In this paper, we propose a novel perceptual image coding technique, named Perceptual Color Compression (PCC). PCC is based on a novel model related to Human Visual System (HVS) spectral sensitivity and CIELAB Just Noticeable Color Difference (JNCD). We utilize this modeling to capitalize on the inability of the HVS to perceptually differentiate photons in very similar wavelength bands (e.g., distinguishing very similar shades of a particular color or different colors that look similar). The proposed PCC technique can be used with RGB (4:4:4) image data of various bit depths and spatial resolutions. In the evaluations, we compare the proposed PCC technique with a set of reference methods including Versatile Video Coding (VVC) and High Efficiency Video Coding (HEVC) in addition to two other recently proposed algorithms. Our PCC method attains considerable BPP reductions compared with all four reference techniques including, on average, 52.6% BPP reductions compared with VVC (VVC in All Intra still image coding mode). Regarding image perceptual reconstruction quality, PCC achieves a score of SSIM = 0.99 in all tests in addition to a score of MS-SSIM = 0.99 in all but one test. Moreover, MOS = 5 is attained in 75% of subjective evaluation assessments conducted.

preprint2021arXiv

Improving Face-Based Age Estimation with Attention-Based Dynamic Patch Fusion

With the increasing popularity of convolutional neural networks (CNNs), recent works on face-based age estimation employ these networks as the backbone. However, state-of-the-art CNN-based methods treat each facial region equally, thus entirely ignoring the importance of some facial patches that may contain rich age-specific information. In this paper, we propose a face-based age estimation framework, called Attention-based Dynamic Patch Fusion (ADPF). In ADPF, two separate CNNs are implemented, namely the AttentionNet and the FusionNet. The AttentionNet dynamically locates and ranks age-specific patches by employing a novel Ranking-guided Multi-Head Hybrid Attention (RMHHA) mechanism. The FusionNet uses the discovered patches along with the facial image to predict the age of the subject. Since the proposed RMHHA mechanism ranks the discovered patches based on their importance, the length of the learning path of each patch in the FusionNet is proportional to the amount of information it carries (the longer, the more important). ADPF also introduces a novel diversity loss to guide the training of the AttentionNet and reduce the overlap among patches so that the diverse and important patches are discovered. Through extensive experiments, we show that our proposed framework outperforms state-of-the-art methods on several age estimation benchmark datasets.

preprint2020arXiv

Ensemble Network for Ranking Images Based on Visual Appeal

We propose a computational framework for ranking images (group photos in particular) taken at the same event within a short time span. The ranking is expected to correspond with human perception of overall appeal of the images. We hypothesize and provide evidence through subjective analysis that the factors that appeal to humans are its emotional content, aesthetics and image quality. We propose a network which is an ensemble of three information channels, each predicting a score corresponding to one of the three visual appeal factors. For group emotion estimation, we propose a convolutional neural network (CNN) based architecture for predicting group emotion from images. This new architecture enforces the network to put emphasis on the important regions in the images, and achieves comparable results to the state-of-the-art. Next, we develop a network for the image ranking task that combines group emotion, aesthetics and image quality scores. Owing to the unavailability of suitable databases, we created a new database of manually annotated group photos taken during various social events. We present experimental results on this database and other benchmark databases whenever available. Overall, our experiments show that the proposed framework can reliably predict the overall appeal of images with results closely corresponding to human ranking.

preprint2020arXiv

Multi-Camera Trajectory Forecasting: Pedestrian Trajectory Prediction in a Network of Cameras

We introduce the task of multi-camera trajectory forecasting (MCTF), where the future trajectory of an object is predicted in a network of cameras. Prior works consider forecasting trajectories in a single camera view. Our work is the first to consider the challenging scenario of forecasting across multiple non-overlapping camera views. This has wide applicability in tasks such as re-identification and multi-target multi-camera tracking. To facilitate research in this new area, we release the Warwick-NTU Multi-camera Forecasting Database (WNMF), a unique dataset of multi-camera pedestrian trajectories from a network of 15 synchronized cameras. To accurately label this large dataset (600 hours of video footage), we also develop a semi-automated annotation method. An effective MCTF model should proactively anticipate where and when a person will re-appear in the camera network. In this paper, we consider the task of predicting the next camera a pedestrian will re-appear after leaving the view of another camera, and present several baseline approaches for this. The labeled database is available online: https://github.com/olly-styles/Multi-Camera-Trajectory-Forecasting.

preprint2020arXiv

Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments

This paper introduces the problem of multiple object forecasting (MOF), in which the goal is to predict future bounding boxes of tracked objects. In contrast to existing works on object trajectory forecasting which primarily consider the problem from a birds-eye perspective, we formulate the problem from an object-level perspective and call for the prediction of full object bounding boxes, rather than trajectories alone. Towards solving this task, we introduce the Citywalks dataset, which consists of over 200k high-resolution video frames. Citywalks comprises of footage recorded in 21 cities from 10 European countries in a variety of weather conditions and over 3.5k unique pedestrian trajectories. For evaluation, we adapt existing trajectory forecasting methods for MOF and confirm cross-dataset generalizability on the MOT-17 dataset without fine-tuning. Finally, we present STED, a novel encoder-decoder architecture for MOF. STED combines visual and temporal features to model both object-motion and ego-motion, and outperforms existing approaches for MOF. Code & dataset link: https://github.com/olly-styles/Multiple-Object-Forecasting

preprint2020arXiv

Spatiotemporal Adaptive Quantization for the Perceptual Video Coding of RGB 4:4:4 Data

Due to the spectral sensitivity phenomenon of the Human Visual System (HVS), the color channels of raw RGB 4:4:4 sequences contain significant psychovisual redundancies; these redundancies can be perceptually quantized. The default quantization systems in the HEVC standard are known as Uniform Reconstruction Quantization (URQ) and Rate Distortion Optimized Quantization (RDOQ); URQ and RDOQ are not perceptually optimized for the coding of RGB 4:4:4 video data. In this paper, we propose a novel spatiotemporal perceptual quantization technique named SPAQ. With application for RGB 4:4:4 video data, SPAQ exploits HVS spectral sensitivity-related color masking in addition to spatial masking and temporal masking; SPAQ operates at the Coding Block (CB) level and the Prediction Unit (PU) level. The proposed technique perceptually adjusts the Quantization Step Size (QStep) at the CB level if high variance spatial data in G, B and R CBs is detected and also if high motion vector magnitudes in PUs are detected. Compared with anchor 1 (HEVC HM 16.17 RExt), SPAQ considerably reduces bitrates with a maximum reduction of approximately 80%. The Mean Opinion Score (MOS) in the subjective evaluations, in addition to the SSIM scores, show that SPAQ successfully achieves perceptually lossless compression compared with anchors.

preprint2016arXiv

Color-Based Coding Unit Level Adaptive Quantization for HEVC

HEVC HM 16 includes a Coding Unit (CU) level perceptual quantization technique named AdaptiveQP. AdaptiveQP adjusts the Quantization Parameter (QP) at the CU level based on the spatial activity of samples in the four constituent NxN sub-blocks of the luma Coding Block (CB), which is contained within a 2Nx2N CU. In this paper, we propose C-BAQ, which, in contrast to AdaptiveQP, adjusts the CU level QP according to the spatial activity of samples in the four constituent NxN sub-blocks of both the luma and chroma CBs. By computing the sum of luma, chroma Cb and chroma Cr spatial activity in a CU, a richer reflection of spatial activity in the CU is attained. Therefore, a more appropriate CU level QP can be selected, thus leading to important improvements in terms of coding efficiency. We evaluate the proposed technique in HEVC HM 16.7 using 4:4:4, 4:2:2 and 4:2:0 YCbCr sequences. Both subjective and objective evaluations are undertaken during which we compare C-BAQ with AdaptiveQP. The objective evaluation reveals that C-BAQ attains a maximum BD-Rate reduction of 15.9% (Y), 13.1% (Cr) and 16.1% (Cb) in addition to a maximum decoding time reduction of 11.0%.

preprint2016arXiv

Context Trees: Augmenting Geospatial Trajectories with Context

Exposing latent knowledge in geospatial trajectories has the potential to provide a better understanding of the movements of individuals and groups. Motivated by such a desire, this work presents the context tree, a new hierarchical data structure that summarises the context behind user actions in a single model. We propose a method for context tree construction that augments geospatial trajectories with land usage data to identify such contexts. Through evaluation of the construction method and analysis of the properties of generated context trees, we demonstrate the foundation for understanding and modelling behaviour afforded. Summarising user contexts into a single data structure gives easy access to information that would otherwise remain latent, providing the basis for better understanding and predicting the actions and behaviours of individuals and groups. Finally, we also present a method for pruning context trees, for use in applications where it is desirable to reduce the size of the tree while retaining useful information.

preprint2016arXiv

Minimizing Compression Artifacts for High Resolutions with Adaptive Quantization Matrices for HEVC

Visual Display Units (VDUs), capable of displaying video data at High Definition (HD) and Ultra HD (UHD) resolutions, are frequently employed in a variety of technological domains. Quantization-induced video compression artifacts, which are usually unnoticeable in low resolution environments, are typically conspicuous on high resolution VDUs and video data. The default quantization matrices (QMs) in HEVC do not take into account specific display resolutions of VDUs or video data to determine the appropriate levels of quantization required to reduce unwanted compression artifacts. Therefore, we propose a novel, adaptive quantization matrix technique for the HEVC standard including Scalable HEVC (SHVC). Our technique, which is based on a refinement of the current QM technique in HEVC, takes into consideration specific display resolutions of the target VDUs in order to minimize compression artifacts. We undertake a thorough evaluation of the proposed technique by utilizing SHVC SHM 9.0 (two-layered bit-stream) and the BD-Rate and SSIM metrics. For the BD-Rate evaluation, the proposed method achieves maximum BD-Rate reductions of 56.5% in the enhancement layer. For the SSIM evaluation, our technique achieves a maximum structural improvement of 0.8660 vs. 0.8538.

Victor Sanchez

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

OpenSocInt: A Multi-modal Training Environment for Human-Aware Social Navigation

TAPM-Net: Trajectory-Aware Perturbation Modeling for Infrared Small Target Detection

Cross-Age Contrastive Learning for Age-Invariant Face Recognition

Spectral-PQ: A Novel Spectral Sensitivity-Orientated Perceptual Compression Technique for RGB 4:4:4 Video Data

Video Anomaly Detection via Prediction Network with Enhanced Spatio-Temporal Memory Exchange

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

HVS-Based Perceptual Color Compression of Image Data

Improving Face-Based Age Estimation with Attention-Based Dynamic Patch Fusion

Ensemble Network for Ranking Images Based on Visual Appeal

Multi-Camera Trajectory Forecasting: Pedestrian Trajectory Prediction in a Network of Cameras

Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments

Spatiotemporal Adaptive Quantization for the Perceptual Video Coding of RGB 4:4:4 Data

Color-Based Coding Unit Level Adaptive Quantization for HEVC

Context Trees: Augmenting Geospatial Trajectories with Context

Minimizing Compression Artifacts for High Resolutions with Adaptive Quantization Matrices for HEVC