Source author record

Dick Botteldooren

Dick Botteldooren appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Sound Artificial Intelligence eess.AS eess.IV Machine Learning Multimedia Neural and Evolutionary Computing Neurons and Cognition

Catalog footprint

What is connected

4works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.

preprint2022arXiv

Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion

Previous works on scene classification are mainly based on audio or visual signals, while humans perceive the environmental scenes through multiple senses. Recent studies on audio-visual scene classification separately fine-tune the largescale audio and image pre-trained models on the target dataset, then either fuse the intermediate representations of the audio model and the visual model, or fuse the coarse-grained decision of both models at the clip level. Such methods ignore the detailed audio events and visual objects in audio-visual scenes (AVS), while humans often identify different scenes through audio events and visual objects within and the congruence between them. To exploit the fine-grained information of audio events and visual objects in AVS, and coordinate the implicit relationship between audio events and visual objects, this paper proposes a multibranch model equipped with contrastive event-object alignment (CEOA) and semantic-based fusion (SF) for AVSC. CEOA aims to align the learned embeddings of audio events and visual objects by comparing the difference between audio-visual event-object pairs. Then, visual objects associated with certain audio events and vice versa are accentuated by cross-attention and undergo SF for semantic-level fusion. Experiments show that: 1) the proposed AVSC model equipped with CEOA and SF outperforms the results of audio-only and visual-only models, i.e., the audio-visual results are better than the results from a single modality. 2) CEOA aligns the embeddings of audio events and related visual objects on a fine-grained level, and the SF effectively integrates both; 3) Compared with other large-scale integrated systems, the proposed model shows competitive performance, even without using additional datasets and data augmentation tricks.

preprint2022arXiv

Axonal Delay As a Short-Term Memory for Feed Forward Deep Spiking Neural Networks

The information of spiking neural networks (SNNs) are propagated between the adjacent biological neuron by spikes, which provides a computing paradigm with the promise of simulating the human brain. Recent studies have found that the time delay of neurons plays an important role in the learning process. Therefore, configuring the precise timing of the spike is a promising direction for understanding and improving the transmission process of temporal information in SNNs. However, most of the existing learning methods for spiking neurons are focusing on the adjustment of synaptic weight, while very few research has been working on axonal delay. In this paper, we verify the effectiveness of integrating time delay into supervised learning and propose a module that modulates the axonal delay through short-term memory. To this end, a rectified axonal delay (RAD) module is integrated with the spiking model to align the spike timing and thus improve the characterization learning ability of temporal features. Experiments on three neuromorphic benchmark datasets : NMNIST, DVS Gesture and N-TIDIGITS18 show that the proposed method achieves the state-of-the-art performance while using the fewest parameters.

preprint2022arXiv

Event-related data conditioning for acoustic event classification

Models based on diverse attention mechanisms have recently shined in tasks related to acoustic event classification (AEC). Among them, self-attention is often used in audio-only tasks to help the model recognize different acoustic events. Self-attention relies on the similarity between time frames, and uses global information from the whole segment to highlight specific features within a frame. In real life, information related to acoustic events will attenuate over time, which means the information within some frames around the event deserves more attention than distant time global information that may be unrelated to the event. This paper shows that self-attention may over-enhance certain segments of audio representations, and smooth out the boundaries between events representations and background noises. Hence, this paper proposes an event-related data conditioning (EDC) for AEC. EDC directly works on spectrograms. The idea of EDC is to adaptively select the frame-related attention range based on acoustic features, and gather the event-related local information to represent the frame. Experiments show that: 1) compared with spectrogram-based data augmentation methods and trainable feature weighting and self-attention, EDC outperforms them in both the original-size mode and the augmented mode; 2) EDC effectively gathers event-related local information and enhances boundaries between events and backgrounds, improving the performance of AEC.

Dick Botteldooren

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion

Axonal Delay As a Short-Term Memory for Feed Forward Deep Spiking Neural Networks

Event-related data conditioning for acoustic event classification