Researcher profile

Junzhi Yu

Junzhi Yu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.

preprint2025arXiv

ReSPIRe: Informative and Reusable Belief Tree Search for Robot Probabilistic Search and Tracking in Unknown Environments

Target search and tracking (SAT) is a fundamental problem for various robotic applications such as search and rescue and environmental exploration. This paper proposes an informative trajectory planning approach, namely ReSPIRe, for SAT in unknown cluttered environments under considerably inaccurate prior target information and limited sensing field of view. We first develop a novel sigma point-based approximation approach to fast and accurately estimate mutual information reward under non-Gaussian belief distributions, utilizing informative sampling in state and observation spaces to mitigate the computational intractability of integral calculation. To tackle significant uncertainty associated with inadequate prior target information, we propose the hierarchical particle structure in ReSPIRe, which not only extracts critical particles for global route guidance, but also adjusts the particle number adaptively for planning efficiency. Building upon the hierarchical structure, we develop the reusable belief tree search approach to build a policy tree for online trajectory planning under uncertainty, which reuses rollout evaluation to improve planning efficiency. Extensive simulations and real-world experiments demonstrate that ReSPIRe outperforms representative benchmark methods with smaller MI approximation error, higher search efficiency, and more stable tracking performance, while maintaining outstanding computational efficiency.

preprint2020arXiv

Joint Anchor-Feature Refinement for Real-Time Accurate Object Detection in Images and Videos

Object detection has been vigorously investigated for years but fast accurate detection for real-world scenes remains a very challenging problem. Overcoming drawbacks of single-stage detectors, we take aim at precisely detecting objects for static and temporal scenes in real time. Firstly, as a dual refinement mechanism, a novel anchor-offset detection is designed, which includes an anchor refinement, a feature location refinement, and a deformable detection head. This new detection mode is able to simultaneously perform two-step regression and capture accurate object features. Based on the anchor-offset detection, a dual refinement network (DRNet) is developed for high-performance static detection, where a multi-deformable head is further designed to leverage contextual information for describing objects. As for temporal detection in videos, temporal refinement networks (TRNet) and temporal dual refinement networks (TDRNet) are developed by propagating the refinement information across time. We also propose a soft refinement strategy to temporally match object motion with the previous refinement. Our proposed methods are evaluated on PASCAL VOC, COCO, and ImageNet VID datasets. Extensive comparisons on static and temporal detection verify the superiority of DRNet, TRNet, and TDRNet. Consequently, our developed approaches run in a fairly fast speed, and in the meantime achieve a significantly enhanced detection accuracy, i.e., 84.4% mAP on VOC 2007, 83.6% mAP on VOC 2012, 69.4% mAP on VID 2017, and 42.4% AP on COCO. Ultimately, producing encouraging results, our methods are applied to online underwater object detection and grasping with an autonomous system. Codes are publicly available at https://github.com/SeanChenxy/TDRN.

preprint2020arXiv

Rethinking Temporal Object Detection from Robotic Perspectives

Video object detection (VID) has been vigorously studied for years but almost all literature adopts a static accuracy-based evaluation, i.e., average precision (AP). From a robotic perspective, the importance of recall continuity and localization stability is equal to that of accuracy, but the AP is insufficient to reflect detectors' performance across time. In this paper, non-reference assessments are proposed for continuity and stability based on object tracklets. These temporal evaluations can serve as supplements to static AP. Further, we develop an online tracklet refinement for improving detectors' temporal performance through short tracklet suppression, fragment filling, and temporal location fusion. In addition, we propose a small-overlap suppression to extend VID methods to single object tracking (SOT) task so that a flexible SOT-by-detection framework is then formed. Extensive experiments are conducted on ImageNet VID dataset and real-world robotic tasks, where the superiority of our proposed approaches are validated and verified. Codes will be publicly available.

preprint2020arXiv

Reveal of Domain Effect: How Visual Restoration Contributes to Object Detection in Aquatic Scenes

Underwater robotic perception usually requires visual restoration and object detection, both of which have been studied for many years. Meanwhile, data domain has a huge impact on modern data-driven leaning process. However, exactly indicating domain effect, the relation between restoration and detection remains unclear. In this paper, we generally investigate the relation of quality-diverse data domain to detection performance. In the meantime, we unveil how visual restoration contributes to object detection in real-world underwater scenes. According to our analysis, five key discoveries are reported: 1) Domain quality has an ignorable effect on within-domain convolutional representation and detection accuracy; 2) low-quality domain leads to higher generalization ability in cross-domain detection; 3) low-quality domain can hardly be well learned in a domain-mixed learning process; 4) degrading recall efficiency, restoration cannot improve within-domain detection accuracy; 5) visual restoration is beneficial to detection in the wild by reducing the domain shift between training data and real-world scenes. Finally, as an illustrative example, we successfully perform underwater object detection with an aquatic robot.

preprint2020arXiv

Temporally Identity-Aware SSD with Attentional LSTM

Temporal object detection has attracted significant attention, but most popular detection methods cannot leverage rich temporal information in videos. Very recently, many algorithms have been developed for video detection task, yet very few approaches can achieve \emph{real-time online} object detection in videos. In this paper, based on attention mechanism and convolutional long short-term memory (ConvLSTM), we propose a temporal single-shot detector (TSSD) for real-world detection. Distinct from previous methods, we take aim at temporally integrating pyramidal feature hierarchy using ConvLSTM, and design a novel structure including a low-level temporal unit as well as a high-level one (LH-TU) for multi-scale feature maps. Moreover, we develop a creative temporal analysis unit, namely, attentional ConvLSTM (AC-LSTM), in which a temporal attention mechanism is specially tailored for background suppression and scale suppression while a ConvLSTM integrates attention-aware features across time. An association loss and a multi-step training are designed for temporal coherence. Besides, an online tubelet analysis (OTA) is exploited for identification. Our framework is evaluated on ImageNet VID dataset and 2DMOT15 dataset. Extensive comparisons on the detection and tracking capability validate the superiority of the proposed approach. Consequently, the developed TSSD-OTA achieves a fast speed and an overall competitive performance in terms of detection and tracking. Finally, a real-world maneuver is conducted for underwater object grasping. The source code is publicly available at https://github.com/SeanChenxy/TSSD-OTA.

preprint2020arXiv

Towards Real-Time Advancement of Underwater Visual Quality with GAN

Low visual quality has prevented underwater robotic vision from a wide range of applications. Although several algorithms have been developed, real-time and adaptive methods are deficient for real-world tasks. In this paper, we address this difficulty based on generative adversarial networks (GAN), and propose a GAN-based restoration scheme (GAN-RS). In particular, we develop a multi-branch discriminator including an adversarial branch and a critic branch for the purpose of simultaneously preserving image content and removing underwater noise. In addition to adversarial learning, a novel dark channel prior loss also promotes the generator to produce realistic vision. More specifically, an underwater index is investigated to describe underwater properties, and a loss function based on the underwater index is designed to train the critic branch for underwater noise suppression. Through extensive comparisons on visual quality and feature restoration, we confirm the superiority of the proposed approach. Consequently, the GAN-RS can adaptively improve underwater visual quality in real time and induce an overall superior restoration performance. Finally, a real-world experiment is conducted on the seabed for grasping marine products, and the results are quite promising. The source code is publicly available at https://github.com/SeanChenxy/GAN_RS.