Researcher profile

Haojie Zhang

Haojie Zhang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models' ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.

preprint2026arXiv

MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.

preprint2026arXiv

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of-$N$ test-time scaling.

preprint2024arXiv

Sur2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

Multi-view surface reconstruction is an ill-posed, inverse problem in 3D vision research. It involves modeling the geometry and appearance with appropriate surface representations. Most of the existing methods rely either on explicit meshes, using surface rendering of meshes for reconstruction, or on implicit field functions, using volume rendering of the fields for reconstruction. The two types of representations in fact have their respective merits. In this work, we propose a new hybrid representation, termed Sur2f, aiming to better benefit from both representations in a complementary manner. Technically, we learn two parallel streams of an implicit signed distance field and an explicit surrogate surface Sur2f mesh, and unify volume rendering of the implicit signed distance function (SDF) and surface rendering of the surrogate mesh with a shared, neural shader; the unified shading promotes their convergence to the same, underlying surface. We synchronize learning of the surrogate mesh by driving its deformation with functions induced from the implicit SDF. In addition, the synchronized surrogate mesh enables surface-guided volume sampling, which greatly improves the sampling efficiency per ray in volume rendering. We conduct thorough experiments showing that Sur$^2$f outperforms existing reconstruction methods and surface representations, including hybrid ones, in terms of both recovery quality and recovery efficiency.

preprint2022arXiv

High-Order Continuous-Variable Coherence of Phase-Dependent Squeezed State

We study continuous variable coherence of phase-dependent squeezed state based on an extended Hanbury Brown-Twiss scheme. High-order coherence is continuously varied by adjusting squeezing parameter $r$, displacement $α$, and squeezing phase $θ$. We also analyze effects of background noise $γ$ and detection efficiency $η$ on the measurements. As the squeezing phase shifts from 0 to $π$, the photon statistics of the squeezed state continuously change from the anti-bunching ($g^{(n)}<1$) to super-bunching ($g^{(n)}>n!$) which shows a transition from particle nature to wave nature. The experiment feasibility is also examined. It provides a practical method to generate phase-dependent squeezed states with high-order continuous-variable coherence by tuning squeezing phase $θ$. The controllable coherence source can be applied to sensitivity improvement in gravitational wave detection and quantum imaging.

preprint2020arXiv

Hybrid Robot-assisted Frameworks for Endomicroscopy Scanning in Retinal Surgeries

High-resolution real-time intraocular imaging of retina at the cellular level is very challenging due to the vulnerable and confined space within the eyeball as well as the limited availability of appropriate modalities. A probe-based confocal laser endomicroscopy (pCLE) system, can be a potential imaging modality for improved diagnosis. The ability to visualize the retina at the cellular level could provide information that may predict surgical outcomes. The adoption of intraocular pCLE scanning is currently limited due to the narrow field of view and the micron-scale range of focus. In the absence of motion compensation, physiological tremors of the surgeons&#39; hand and patient movements also contribute to the deterioration of the image quality. Therefore, an image-based hybrid control strategy is proposed to mitigate the above challenges. The proposed hybrid control strategy enables a shared control of the pCLE probe between surgeons and robots to scan the retina precisely, with the absence of hand tremors and with the advantages of an image-based auto-focus algorithm that optimizes the quality of pCLE images. The hybrid control strategy is deployed on two frameworks - cooperative and teleoperated. Better image quality, smoother motion, and reduced workload are all achieved in a statistically significant manner with the hybrid control frameworks.