Researcher profile

Maoxun Yuan

Maoxun Yuan contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 13 - UnverifiedVerification L1Unclaimed author
2works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

2 published item(s)

preprint2026arXiv

Breaking Self-Attention Failure: Rethinking Query Initialization for Infrared Small Target Detection

Infrared small target detection (IRSTD) faces significant challenges due to the low signal-to-noise ratio (SNR), small target size, and complex cluttered backgrounds. Although recent DETR-based detectors benefit from global context modeling, they exhibit notable performance degradation on IRSTD. We revisit this phenomenon and reveal that the target-relevant embeddings of IRST are inevitably overwhelmed by dominant background features due to the self-attention mechanism, leading to unreliable query initialization and inaccurate target localization. To address this issue, we propose SEF-DETR, a novel framework that refines query initialization for IRSTD. Specifically, SEF-DETR consists of three components: Frequency-guided Patch Screening (FPS), Dynamic Embedding Enhancement (DEE), and Reliability-Consistency-aware Fusion (RCF). The FPS module leverages the Fourier spectrum of local patches to construct a target-relevant density map, suppressing background-dominated features. DEE strengthens multi-scale representations in a target-aware manner, while RCF further refines object queries by enforcing spatial-frequency consistency and reliability. Extensive experiments on three public IRSTD datasets demonstrate that SEF-DETR achieves superior detection performance compared to state-of-the-art methods, delivering a robust and efficient solution for infrared small target detection task.

preprint2025arXiv

RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.