Researcher profile

Cheng-Yen Yang

Cheng-Yen Yang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo

preprint2026arXiv

Reasoning Matters for 3D Visual Grounding

The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.

preprint2023arXiv

CameraPose: Weakly-Supervised Monocular 3D Human Pose Estimation by Leveraging In-the-wild 2D Annotations

To improve the generalization of 3D human pose estimators, many existing deep learning based models focus on adding different augmentations to training poses. However, data augmentation techniques are limited to the "seen" pose combinations and hard to infer poses with rare "unseen" joint positions. To address this problem, we present CameraPose, a weakly-supervised framework for 3D human pose estimation from a single image, which can not only be applied on 2D-3D pose pairs but also on 2D alone annotations. By adding a camera parameter branch, any in-the-wild 2D annotations can be fed into our pipeline to boost the training diversity and the 3D poses can be implicitly learned by reprojecting back to 2D. Moreover, CameraPose introduces a refinement network module with confidence-guided loss to further improve the quality of noisy 2D keypoints extracted by 2D pose estimators. Experimental results demonstrate that the CameraPose brings in clear improvements on cross-scenario datasets. Notably, it outperforms the baseline method by 3mm on the most challenging dataset 3DPW. In addition, by combining our proposed refinement network module with existing 3D pose estimators, their performance can be improved in cross-scenario evaluation.

preprint2022arXiv

GaitTAKE: Gait Recognition by Temporal Attention and Keypoint-guided Embedding

Gait recognition, which refers to the recognition or identification of a person based on their body shape and walking styles, derived from video data captured from a distance, is widely used in crime prevention, forensic identification, and social security. However, to the best of our knowledge, most of the existing methods use appearance, posture and temporal feautures without considering a learned temporal attention mechanism for global and local information fusion. In this paper, we propose a novel gait recognition framework, called Temporal Attention and Keypoint-guided Embedding (GaitTAKE), which effectively fuses temporal-attention-based global and local appearance feature and temporal aggregated human pose feature. Experimental results show that our proposed method achieves a new SOTA in gait recognition with rank-1 accuracy of 98.0% (normal), 97.5% (bag) and 92.2% (coat) on the CASIA-B gait dataset; 90.4% accuracy on the OU-MVLP gait dataset.

preprint2022arXiv

NTIRE 2021 Multi-modal Aerial View Object Classification Challenge

In this paper, we introduce the first Challenge on Multi-modal Aerial View Object Classification (MAVOC) in conjunction with the NTIRE 2021 workshop at CVPR. This challenge is composed of two different tracks using EO andSAR imagery. Both EO and SAR sensors possess different advantages and drawbacks. The purpose of this competition is to analyze how to use both sets of sensory information in complementary ways. We discuss the top methods submitted for this competition and evaluate their results on our blind test set. Our challenge results show significant improvement of more than 15% accuracy from our current baselines for each track of the competition

preprint2022arXiv

Unsupervised Domain Adaptation Learning for Hierarchical Infant Pose Recognition with Synthetic Data

The Alberta Infant Motor Scale (AIMS) is a well-known assessment scheme that evaluates the gross motor development of infants by recording the number of specific poses achieved. With the aid of the image-based pose recognition model, the AIMS evaluation procedure can be shortened and automated, providing early diagnosis or indicator of potential developmental disorder. Due to limited public infant-related datasets, many works use the SMIL-based method to generate synthetic infant images for training. However, this domain mismatch between real and synthetic training samples often leads to performance degradation during inference. In this paper, we present a CNN-based model which takes any infant image as input and predicts the coarse and fine-level pose labels. The model consists of an image branch and a pose branch, which respectively generates the coarse-level logits facilitated by the unsupervised domain adaptation and the 3D keypoints using the HRNet with SMPLify optimization. Then the outputs of these branches will be sent into the hierarchical pose recognition module to estimate the fine-level pose labels. We also collect and label a new AIMS dataset, which contains 750 real and 4000 synthetic infants images with AIMS pose labels. Our experimental results show that the proposed method can significantly align the distribution of synthetic and real-world datasets, thus achieving accurate performance on fine-grained infant pose recognition.