Researcher profile

Xu Cao

Xu Cao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

preprint2025arXiv

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Spurious bias, a tendency to exploit spurious correlations between superficial input attributes and prediction targets, has revealed a severe robustness pitfall in classical machine learning problems. Multimodal Large Language Models (MLLMs), which leverage pretrained vision and language models, have recently demonstrated strong capability in joint vision-language understanding. However, both the presence and severity of spurious biases in MLLMs remain poorly understood. In this work, we address this gap by analyzing the spurious biases in the multimodal setting and uncovering the specific inference-time data patterns that can manifest this problem. To support this analysis, we introduce MM-SpuBench, a comprehensive, human-verified benchmark dataset consisting of image-class pairs annotated with core and spurious attributes, grounded in our taxonomy of nine distinct types of spurious correlations. The benchmark is constructed using human-interpretable attribute information to capture a wide range of spurious patterns reflective of real-world knowledge. Leveraging this benchmark, we conduct a comprehensive evaluation of the state-of-the-art open-source and proprietary MLLMs with both standard accuracy and the proposed Conditional Generation Likelihood Advantage (CGLA). Our findings highlight the persistence of reliance on spurious correlations and the difficulty of mitigation on our benchmark. We hope this work can inspire new technical strides to mitigate these biases. Our benchmark is publicly available at https://huggingface.co/datasets/mmbench/MM-SpuBench.

preprint2025arXiv

Three-dimensional imaging of hadrons with hard exclusive reactions: advances in experiment, theory, phenomenology, and lattice QCD

Generalized Parton Distributions (GPDs) have emerged as a powerful framework for exploring the internal structure of hadrons in terms of their partonic constituents. Over the past three decades, the field has witnessed significant theoretical and experimental advancements. The interpretation of GPDs in impact parameter space offers a vivid three-dimensional visualization of hadron structure, correlating longitudinal momentum and transverse spatial distributions, thereby enabling tomographic imaging of hadrons. Furthermore, the link between GPDs and the matrix elements of the QCD energy-momentum tensor provides access to fundamental properties of hadrons, including spin decomposition and internal pressure distributions. Notably, recent analyses of Deeply Virtual Compton Scattering (DVCS) data have enabled the empirical extraction of the quark pressure profile inside the proton. This white paper presents an overview of recent developments in GPD theory and phenomenology, as well as progress in lattice QCD studies. It outlines the prospects for advancing our understanding of hadron structure through the next generation of dedicated experiments, including the extension of the Jefferson Lab 12~GeV program (and its potential 22~GeV upgrade), J-PARC, COMPASS/AMBER, LHC ultra-peripheral collisions, and the future electron-ion colliders EIC and EicC.

preprint2024arXiv

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

The prominent large language models (LLMs) of today differ from past language models not only in size, but also in the fact that they are trained on a combination of natural language and formal language (code). As a medium between humans and computers, code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity. In this survey, we present an overview of the various benefits of integrating code into LLMs' training data. Specifically, beyond enhancing LLMs in code generation, we observe that these unique properties of code help (i) unlock the reasoning ability of LLMs, enabling their applications to a range of more complex natural language tasks; (ii) steer LLMs to produce structured and precise intermediate steps, which can then be connected to external execution ends through function calls; and (iii) take advantage of code compilation and execution environment, which also provides diverse feedback for model improvement. In addition, we trace how these profound capabilities of LLMs, brought by code, have led to their emergence as intelligent agents (IAs) in situations where the ability to understand instructions, decompose goals, plan and execute actions, and refine from feedback are crucial to their success on downstream tasks. Finally, we present several key challenges and future directions of empowering LLMs with code.

preprint2022arXiv

A Compacted Structure for Cross-domain learning on Monocular Depth and Flow Estimation

Accurate motion and depth recovery is important for many robot vision tasks including autonomous driving. Most previous studies have achieved cooperative multi-task interaction via either pre-defined loss functions or cross-domain prediction. This paper presents a multi-task scheme that achieves mutual assistance by means of our Flow to Depth (F2D), Depth to Flow (D2F), and Exponential Moving Average (EMA). F2D and D2F mechanisms enable multi-scale information integration between optical flow and depth domain based on differentiable shallow nets. A dual-head mechanism is used to predict optical flow for rigid and non-rigid motion based on a divide-and-conquer manner, which significantly improves the optical flow estimation performance. Furthermore, to make the prediction more robust and stable, EMA is used for our multi-task training. Experimental results on KITTI datasets show that our multi-task scheme outperforms other multi-task schemes and provide marked improvements on the prediction results.

preprint2022arXiv

Comparing fast imaging techniques for individual pulse imaging by Cherenkov in vivo from electron FLASH irradiation

Objective: In this study, a fast imaging technique was developed for the first in vivo Cherenkov emission imaging from an ultra-high dose rate (UHDR) electron beam source at single pulse (360 Hz) submillimeter resolution. Approach: A CMOS camera, gated to the UHDR LINAC, imaged the Cherenkov emission profiles pulse by pulse passively during the irradiation of mice on their limbs and intestinal region. The utility of an intensifier was investigated for its effect on image quality including signal to noise and spatial resolution. Pulse by pulse variability in Cherenkov emission profile were quantified spatially and temporally. Main results: An intensifier improved the emission profile signal to noise ratio from 15 to 280, with reduced spatial resolution. The profile extended beyond of the treatment field due to the lateral scattering of the electrons in tissue and its optical properties. The CMOS camera with an intensifier detected the changes in Cherenkov emission profile during expiration and inspiration of the respiration cycle for the mice to be about 3 mm. Significance: This fast imaging technique can be utilized for in vivo intrafraction monitoring of FLASH patient treatments at single pulse resolution. It can display delivery differences during respiration, and variability in the delivered treatment's surface profile, which may perturb from the intended UHDR treatment more for pencil beam scanning systems. The technique may leverage Cherenkov emission surface profile to gate the treatment delivery via respiratory gating systems under FLASH conditions.

preprint2022arXiv

Timelike nucleon electromagnetic form factors: All about interference of isospin amplitudes

A striking feature of the timelike nucleon electromagnetic form factors, investigated in $e^+e^- \to N\bar N$ annihilation reactions, is the modulation by local structures of small magnitude and oscillatory form, showing up above $N\bar N$ threshold. Starting from an isospin decomposition of the proton and neutron form factors it is shown that such structures are the natural consequence of the interference of a large and a small amplitudes, resulting in a sinusoidal behavior as a function of the "invariant energy" if the relative phase shift varies with energy. Thus, periodic oscillations superimposed on a smooth background will be observed. In this scenario, an equal size of the modulation for neutron and proton discovered by recent BESIII data evidently implies the particular isoscalar or isovector nature of these local structures, or their orthogonal interference, hence specifies their origin as excited vector mesons whose widths are tied to the modulation frequency.We clarify that the phase difference of modulation between neutron and proton as BESIII data found, but not the modulation itself, is the evidence of an imaginary part of the timelike nucleon electromagnetic form factors, which is associated with the rescattering processes.

preprint2021arXiv

Electron FLASH Delivery at Treatment Room Isocenter for Efficient Reversible Conversion of a Clinical LINAC

Purpose: In this study, procedures were developed to achieve efficient reversible conversion of a clinical linear accelerator (LINAC) and deliver electron FLASH (eFLASH) or conventional beams to the treatment room isocenter. Material & Methods: The LINAC was converted to deliver eFLASH beam within 20 minutes by retracting the x-ray target from the beam&#39;s path, positioning the carousel on an empty port, and selecting 10 MV photon beam energy in the treatment console. Dose per pulse and average dose rate were measured in a solid water phantom at different depths with Gafchromic film and OSLD. A pulse controller counted the pulses via scattered radiation signal and gated the delivery for preset pulse count. A fast photomultiplier tube-based Cherenkov detector measured per pulse beam output at 2 ns sampling rate. After conversion back to clinical mode, conventional beam output, flatness, symmetry, field size and energy were measured for all clinically commissioned energies. Results: Dose per pulse of 0.86 +/- 0.01 Gy (310 +/- 7 Gy/s average dose rate) were achieved at isocenter. The dose from simultaneous irradiation of film and OSLD were within 1%. The PMT showed the LINAC required about 5 pulses before the output stabilized and its long-term stability was within 3% for measurements performed at 3 minutes intervals. The dose, flatness, symmetry, and photon energy were unchanged from baseline and within tolerance (1%, 3%, 2%, and 0.1% respectively) after reverting to conventional beams. Conclusion: 10 MeV FLASH beams were achieved at the isocenter of the treatment room. The beam output was reproducible but requires further investigation of the ramp up time in the first 5 pulses, equivalent to <100 cGy. The eFLASH beam can irradiate both small and large subjects in minimally modified clinical settings and dose rates can be further increased by reducing the source to surface distance.

preprint2021arXiv

Individual Pulse Monitoring and Dose Control System for Pre-Clinical Implementation of FLASH-RT

Ultra-high dose rate electron sources require dose rate independent dosimeters and a calibrated dose control system for accurate delivery. In this study, we developed a single-pulse dose monitoring and a real-time dose-based control system for a converted clinical linear accelerator (LINAC). A point scintillator detector was coupled to a gated amplifier and a real-time controller for dose monitoring and feedback control loop. The controller was programmed to integrate dose and measure pulse width of each radiation pulse and gate the LINAC beam when the prescribed dose was delivered. The scintillator was mounted in solid water phantom and placed underneath mice skin for in vivo dose monitoring. Additionally, the scintillator was characterized in terms of its radiation stability, mean dose-rate, and dose per pulse dependence. Dose integration was performed for each radiation pulse and displayed in real-time. The scintillator was shown to be linear with mean dose-rate (40-380 Gy/s) and dose per pulse (0.3-1.3 Gy/Pulse) to within +/- 3%. However, the plastic scintillator was subject to significant radiation damage (16%/kGy) and would need to be calibrated frequently. Pulse-counting control was accurately implemented with direct correspondence between the intended and the actual delivered pulses. The dose-based control was sufficient to gate on any pulse of the LINAC. In-vivo dosimetry monitoring with a 1 cm circular cut-out revealed that a ramp-up of 4-5 pulses was present during which the average dose per pulse was ~0.045 +/- 0.004 Gy/Pulse, whereas after the ramp-up it stabilized at 0.65 +/- 0.01 Gy/Pulse. The tools presented in this study can be used to determine the beam parameter space pertinent to the FLASH effect. Additionally, this study is the first instance of real-time dose-based control for a modified LINAC at ultra-high dose rates.

preprint2021arXiv

Photoproduction of strange hidden-charm and hidden-bottom states

Recently BESIII collaboration discovered a charged strange hidden-charm state $Z_{cs}$(3985) in the $D_s^-D^{*0} + D_s^{*-}D^{0}$ spectrum. A higher $Z&#39;_{cs}$ state coupling to $\bar{D}_s^{*-}D^{*0}$ is expected by SU(3)-flavor symmetry, and their bottom partners are anticipated by heavy quark flavor symmetry. Here we study the photoproduction of these exotic states and investigate carefully the background from Pomeron exchange. Our results indicate that the maximal photoproduction cross section of strange partner is around 1 $\sim$ 2 orders of magnitude smaller than that of the corresponding non-strange states. The possibility of searching for them in future electron-ion colliders (EIC) is briefly discussed.

preprint2021arXiv

Treatment Planning System for Electron FLASH Radiotherapy: Open-source for Clinical Implementation

Purpose: A Monte Carlo (MC) beam model and its implementation in a clinical treatment planning system (TPS, Varian Eclipse) are presented for a modified ultra-high dose-rate electron FLASH radiotherapy (eFLASH-RT) LINAC. Methods: The gantry head without scattering foils or targets, representative of the LINAC modifications, was modelled in Geant4. The energy spectrum (σE) and beam source emittance cone angle (θcone) were varied to match the calculated and Gafchromic film measured central-axis percent depth dose (PDD) and lateral profiles. Its Eclipse configuration was validated with measured profiles of the open field and nominal fields for clinical applicators. eFLASH-RT plans were MC forward calculated in Geant4 for a mouse brain treatment and compared to a conventional (Conv-RT) plan in Eclipse for a human patient with metastatic renal cell carcinoma. Results: The beam model and its Eclipse configuration agreed best with measurements at σE=0.5 MeV and θcone=3.9+/-0.2 degrees to clinically acceptable accuracy (the absolute average error was within 1.5% for in-water lateral, 3% for in-air lateral, and 2% for PDD). The forward dose calculation showed dose was delivered to the entire mouse brain with adequate conformality. The human patient case demonstrated the planning capability with routine accessories in relatively complex geometry to achieve an acceptable plan (90% of the tumor volume receiving 95% and 90% of the prescribed dose for eFLASH and Conv-RT, respectively). Conclusion: To the best of our knowledge, this is the first functional beam model commissioned in a clinical TPS for eFLASH-RT, enabling planning and evaluation with minimal deviation from Conv-RT workflow. It facilitates the clinical translation as eFLASH-RT and Conv-RT plan quality were comparable for a human patient. The methods can be expanded to model other eFLASH irradiators.

preprint2020arXiv

Identify the hidden charm pentaquark signal from non-resonant background in electron-proton scattering

We study the electroproduction of the LHCb pentaquark states with the assumption that they are resonant states. The main concern here is to investigate the final state distribution in the phase space in order to extract the feeble pentaquark signal from the large non-resonant background. Our results show that the ratio of the signal to background would increase significantly with proper kinematic cut, which would be very helpful for future experimental analysis.

preprint2020arXiv

Photoproduction of hidden-bottom pentaquark and related topics

Due to the discovery of the hidden-charm pentaquark $P_c$ states by the LHCb collaboration, the interests on the candidates of hidden-bottom pentaquark $P_b$ states are increasing. They are anticipated to exist as the analogues of the $P_c$ states in the bottom sector and predicted by many models. We give an exploration of searching for a typical $P_b$ in the $γp \to Υp$ reaction, which shows a promising potential to observe it at an electron-ion collider. The possibility of searching for $P_b$ in open-bottom channels are also briefly discussed. Meanwhile, the $t$-channel non-resonant contribution, which in fact covers several interesting topics at low energies, is systematically investigated.

preprint2014arXiv

Charmonium resonances and Fano line shapes

Anomalous line shapes of quarkonia are explained naturally as an interference effect of a $c\bar c$ confined closed channel with the surrounding continua, well established in other fields of physics as Fano-resonances. We discuss a quark model coupled-channel analysis describing quarkonium as a mixing of closed $Q\bar Q$ and molecular-like $D\bar D$ open channels. The asymmetric line shapes observed in $ψ(3770)$ production cross sections in $e^+e^-$ annihilation to $D^0\bar{D}^0$ and $D^+ D^-$, respectively, are described very well. The method allows to extract directly from the data the amount of $Q\bar Q \leftrightarrow D\bar D$ configuration mixing.