Researcher profile

Zhaoxin Fan

Zhaoxin Fan contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

CUBic: Coordinated Unified Bimanual Perception and Control Framework

Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side -- either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination -- thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.

preprint2026arXiv

DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage--where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports--remains under-evaluated due to the subjectivity of open-ended writing. To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineering research requests and constructing "Oracle Contexts" from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming subjective judgment into verifiable metrics. Experiments across 96 tasks reveal that synthesizing information from hundreds of references remains a significant challenge. Our results demonstrate that agentic plan-and-write workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.

preprint2026arXiv

State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

Multimodal large language models (MLLMs) have achieved impressive progress on general multimodal tasks, yet they remain brittle on dial-based measurement reading. In this paper, we study this problem through controlled benchmarks and feature-space probing, and show that current MLLMs not only achieve unsatisfactory accuracy on dial-based readout, but also suffer sharp performance drops under viewpoint and illumination changes even when the underlying dial state remains fixed. Our probing analysis further reveals that same-state samples under appearance variation are not consistently clustered, while neighboring states fail to preserve the local structure implied by continuous dial values. These findings suggest that existing MLLMs largely ignore the intrinsic state geometry of dial measurement tasks and instead rely on superficial appearance cues. Motivated by this diagnosis, we propose TriSCA, a tri-level state-consistent alignment framework for dial-based measurement reading. Specifically, TriSCA consists of state-distance-aware representation alignment, metadata-grounded observation-to-state supervision, and state-aware objective alignment. Extensive ablation studies and evaluation experiments on controlled clock and gauge benchmarks, together with evaluation on an external real-world benchmark, demonstrate the effectiveness of our method.

preprint2024arXiv

FuRPE: Learning Full-body Reconstruction from Part Experts

In the field of full-body reconstruction, the scarcity of annotated data often impedes the efficacy of prevailing methods. To address this issue, we introduce FuRPE, a novel framework that employs part-experts and an ingenious pseudo ground-truth selection scheme to derive high-quality pseudo labels. These labels, central to our approach, equip our network with the capability to efficiently learn from the available data. Integral to FuRPE is a unique exponential moving average training strategy and expert-derived feature distillation strategy. These novel elements of FuRPE not only serve to further refine the model but also to reduce potential biases that may arise from inaccuracies in pseudo labels, thereby optimizing the network's training process and enhancing the robustness of the model. We apply FuRPE to train both two-stage and fully convolutional single-stage full-body reconstruction networks. Our exhaustive experiments on numerous benchmark datasets illustrate a substantial performance boost over existing methods, underscoring FuRPE's potential to reshape the state-of-the-art in full-body reconstruction.

preprint2022arXiv

Deep Learning on Monocular Object Pose Detection and Tracking: A Comprehensive Overview

Object pose detection and tracking has recently attracted increasing attention due to its wide applications in many areas, such as autonomous driving, robotics, and augmented reality. Among methods for object pose detection and tracking, deep learning is the most promising one that has shown better performance than others. However, survey study about the latest development of deep learning-based methods is lacking. Therefore, this study presents a comprehensive review of recent progress in object pose detection and tracking that belongs to the deep learning technical route. To achieve a more thorough introduction, the scope of this study is limited to methods taking monocular RGB/RGBD data as input and covering three kinds of major tasks: instance-level monocular object pose detection, category-level monocular object pose detection, and monocular object pose tracking. In our work, metrics, datasets, and methods of both detection and tracking are presented in detail. Comparative results of current state-of-the-art methods on several publicly available datasets are also presented, together with insightful observations and inspiring future research directions.

preprint2022arXiv

Object Level Depth Reconstruction for Category Level 6D Object Pose Estimation From Monocular RGB Image

Recently, RGBD-based category-level 6D object pose estimation has achieved promising improvement in performance, however, the requirement of depth information prohibits broader applications. In order to relieve this problem, this paper proposes a novel approach named Object Level Depth reconstruction Network (OLD-Net) taking only RGB images as input for category-level 6D object pose estimation. We propose to directly predict object-level depth from a monocular RGB image by deforming the category-level shape prior into object-level depth and the canonical NOCS representation. Two novel modules named Normalized Global Position Hints (NGPH) and Shape-aware Decoupled Depth Reconstruction (SDDR) module are introduced to learn high fidelity object-level depth and delicate shape representations. At last, the 6D object pose is solved by aligning the predicted canonical representation with the back-projected object-level depth. Extensive experiments on the challenging CAMERA25 and REAL275 datasets indicate that our model, though simple, achieves state-of-the-art performance.

preprint2022arXiv

RPR-Net: A Point Cloud-based Rotation-aware Large Scale Place Recognition Network

Point cloud-based large scale place recognition is an important but challenging task for many applications such as Simultaneous Localization and Mapping (SLAM). Taking the task as a point cloud retrieval problem, previous methods have made delightful achievements. However, how to deal with catastrophic collapse caused by rotation problems is still under-explored. In this paper, to tackle the issue, we propose a novel Point Cloud-based Rotation-aware Large Scale Place Recognition Network (RPR-Net). In particular, to solve the problem, we propose to learn rotation-invariant features in three steps. First, we design three kinds of novel Rotation-Invariant Features (RIFs), which are low-level features that can hold the rotation-invariant property. Second, using these RIFs, we design an attentive module to learn rotation-invariant kernels. Third, we apply these kernels to previous point cloud features to generate new features, which is the well-known SO(3) mapping process. By doing so, high-level scene-specific rotation-invariant features can be learned. We call the above process an Attentive Rotation-Invariant Convolution (ARIConv). To achieve the place recognition goal, we build RPR-Net, which takes ARIConv as a basic unit to construct a dense network architecture. Then, powerful global descriptors used for retrieval-based place recognition can be sufficiently extracted from RPR-Net. Experimental results on prevalent datasets show that our method achieves comparable results to existing state-of-the-art place recognition models and significantly outperforms other rotation-invariant baseline models when solving rotation problems.