Researcher profile

Pinhao Song

Pinhao Song contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

A central challenge of visual control with model-based reinforcement learning (RL) is reliable long-horizon planning: long rollouts with learned latent dynamics exhibit branching futures and multi-modal action-value distributions. In addition, compounding model errors amplified by visual occlusions make deep imagination brittle. We present ELVIS, a latent model predictive controller (MPC) designed to make long-horizon planning practical. ELVIS plans in a Dreamer-style recurrent state space model (RSSM) and replaces standard unimodal model predictive path integral (MPPI) with a Gaussian-mixture MPPI that maintains multiple coherent hypotheses over long horizons, avoiding mode averaging under branching rollouts. In parallel, ELVIS stabilizes deep imagination with a shared uncertainty-aware lambda-return: an ensemble of latent critics defines an upper-confidence-bound (UCB) score that gates a time-varying lambda, adaptively trading off bootstrapping versus look-ahead to limit compounding error during planning. The same return is used both to train an actor-critic prior from imagined rollouts and to score candidate trajectories inside GMM-MPPI, aligning RL objectives with the planner's long-horizon optimization. On fourteen DeepMind Control Suite visual tasks, ELVIS establishes state-of-the-art performance compared with TD-MPC2 and DreamerV3. Finally, ELVIS transfers zero-shot to a real-world sand-spraying task with severe occlusions, improving surface-quality metrics and demonstrating robustness beyond simulation.

preprint2023arXiv

A Gated Cross-domain Collaborative Network for Underwater Object Detection

Underwater object detection (UOD) plays a significant role in aquaculture and marine environmental protection. Considering the challenges posed by low contrast and low-light conditions in underwater environments, several underwater image enhancement (UIE) methods have been proposed to improve the quality of underwater images. However, only using the enhanced images does not improve the performance of UOD, since it may unavoidably remove or alter critical patterns and details of underwater objects. In contrast, we believe that exploring the complementary information from the two domains is beneficial for UOD. The raw image preserves the natural characteristics of the scene and texture information of the objects, while the enhanced image improves the visibility of underwater objects. Based on this perspective, we propose a Gated Cross-domain Collaborative Network (GCC-Net) to address the challenges of poor visibility and low contrast in underwater environments, which comprises three dedicated components. Firstly, a real-time UIE method is employed to generate enhanced images, which can improve the visibility of objects in low-contrast areas. Secondly, a cross-domain feature interaction module is introduced to facilitate the interaction and mine complementary information between raw and enhanced image features. Thirdly, to prevent the contamination of unreliable generated results, a gated feature fusion module is proposed to adaptively control the fusion ratio of cross-domain information. Our method presents a new UOD paradigm from the perspective of cross-domain information interaction and fusion. Experimental results demonstrate that the proposed GCC-Net achieves state-of-the-art performance on four underwater datasets.

preprint2023arXiv

Achieving Domain Generalization in Underwater Object Detection by Domain Mixup and Contrastive Learning

The performance of existing underwater object detection methods degrades seriously when facing domain shift caused by complicated underwater environments. Due to the limitation of the number of domains in the dataset, deep detectors easily memorize a few seen domains, which leads to low generalization ability. There are two common ideas to improve the domain generalization performance. First, it can be inferred that the detector trained on as many domains as possible is domain-invariant. Second, for the images with the same semantic content in different domains, their hidden features should be equivalent. This paper further excavates these two ideas and proposes a domain generalization framework (named DMC) that learns how to generalize across domains from Domain Mixup and Contrastive Learning. First, based on the formation of underwater images, an image in an underwater environment is the linear transformation of another underwater environment. Thus, a style transfer model, which outputs a linear transformation matrix instead of the whole image, is proposed to transform images from one source domain to another, enriching the domain diversity of the training data. Second, mixup operation interpolates different domains on the feature level, sampling new domains on the domain manifold. Third, contrastive loss is selectively applied to features from different domains to force the model to learn domain invariant features but retain the discriminative capacity. With our method, detectors will be robust to domain shift. Also, a domain generalization benchmark S-UODAC2020 for detection is set up to measure the performance of our method. Comprehensive experiments on S-UODAC2020 and two object recognition benchmarks (PACS and VLCS) demonstrate that the proposed method is able to learn domain-invariant representations, and outperforms other domain generalization methods.

preprint2022arXiv

AO2-DETR: Arbitrary-Oriented Object Detection Transformer

Arbitrary-oriented object detection (AOOD) is a challenging task to detect objects in the wild with arbitrary orientations and cluttered arrangements. Existing approaches are mainly based on anchor-based boxes or dense points, which rely on complicated hand-designed processing steps and inductive bias, such as anchor generation, transformation, and non-maximum suppression reasoning. Recently, the emerging transformer-based approaches view object detection as a direct set prediction problem that effectively removes the need for hand-designed components and inductive biases. In this paper, we propose an Arbitrary-Oriented Object DEtection TRansformer framework, termed AO2-DETR, which comprises three dedicated components. More precisely, an oriented proposal generation mechanism is proposed to explicitly generate oriented proposals, which provides better positional priors for pooling features to modulate the cross-attention in the transformer decoder. An adaptive oriented proposal refinement module is introduced to extract rotation-invariant region features and eliminate the misalignment between region features and objects. And a rotation-aware set matching loss is used to ensure the one-to-one matching process for direct set prediction without duplicate predictions. Our method considerably simplifies the overall pipeline and presents a new AOOD paradigm. Comprehensive experiments on several challenging datasets show that our method achieves superior performance on the AOOD task.

preprint2022arXiv

Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition

Self-supervised skeleton-based action recognition with contrastive learning has attracted much attention. Recent literature shows that data augmentation and large sets of contrastive pairs are crucial in learning such representations. In this paper, we found that directly extending contrastive pairs based on normal augmentations brings limited returns in terms of performance, because the contribution of contrastive pairs from the normal data augmentation to the loss get smaller as training progresses. Therefore, we delve into hard contrastive pairs for contrastive learning. Motivated by the success of mixing augmentation strategy which improves the performance of many tasks by synthesizing novel samples, we propose SkeleMixCLR: a contrastive learning framework with a spatio-temporal skeleton mixing augmentation (SkeleMix) to complement current contrastive learning approaches by providing hard contrastive samples. First, SkeleMix utilizes the topological information of skeleton data to mix two skeleton sequences by randomly combing the cropped skeleton fragments (the trimmed view) with the remaining skeleton sequences (the truncated view). Second, a spatio-temporal mask pooling is applied to separate these two views at the feature level. Third, we extend contrastive pairs with these two views. SkeleMixCLR leverages the trimmed and truncated views to provide abundant hard contrastive pairs since they involve some context information from each other due to the graph convolution operations, which allows the model to learn better motion representations for action recognition. Extensive experiments on NTU-RGB+D, NTU120-RGB+D, and PKU-MMD datasets show that SkeleMixCLR achieves state-of-the-art performance. Codes are available at https://github.com/czhaneva/SkeleMixCLR.

preprint2022arXiv

Excavating RoI Attention for Underwater Object Detection

Self-attention is one of the most successful designs in deep learning, which calculates the similarity of different tokens and reconstructs the feature based on the attention matrix. Originally designed for NLP, self-attention is also popular in computer vision, and can be categorized into pixel-level attention and patch-level attention. In object detection, RoI features can be seen as patches from base feature maps. This paper aims to apply the attention module to RoI features to improve performance. Instead of employing an original self-attention module, we choose the external attention module, a modified self-attention with reduced parameters. With the proposed double head structure and the Positional Encoding module, our method can achieve promising performance in object detection. The comprehensive experiments show that it achieves promising performance, especially in the underwater object detection dataset. The code will be avaiable in: https://github.com/zsyasd/Excavating-RoI-Attention-for-Underwater-Object-Detection

preprint2020arXiv

WQT and DG-YOLO: towards domain generalization in underwater object detection

A General Underwater Object Detector (GUOD) should perform well on most of underwater circumstances. However, with limited underwater dataset, conventional object detection methods suffer from domain shift severely. This paper aims to build a GUOD with small underwater dataset with limited types of water quality. First, we propose a data augmentation method Water Quality Transfer (WQT) to increase domain diversity of the original small dataset. Second, for mining the semantic information from data generated by WQT, DG-YOLO is proposed, which consists of three parts: YOLOv3, DIM and IRM penalty. Finally, experiments on original and synthetic URPC2019 dataset prove that WQT+DG-YOLO achieves promising performance of domain generalization in underwater object detection.