Researcher profile

Guohui Zhang

Guohui Zhang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

preprint2026arXiv

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses 6 core competencies that focus on perceptivity and interactivity, encompassing 1,000 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

preprint2026arXiv

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

preprint2026arXiv

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

preprint2020arXiv

Measurement of the neutron beam profile of the Back-n white neutron facility at CSNS with a Micromegas detector

The Back-n white neutron beam line, which uses back-streaming white neutrons from the spallation target of the China Spallation Neutron Source, is used for nuclear data measurements. A Micromegas-based neutron detector with two variants was specially developed to measure the beam spot distribution for this beam line. In this article, the design, fabrication, and characterization of the detector are described. The results of the detector performance tests are presented, which include the relative electron transparency, the gain and the gain uniformity, and the neutron beam profile reconstruction capability. The result of the first measurement of the Back-n neutron beam spot distribution is also presented.

preprint2019arXiv

Measurements of differential and angle-integrated cross sections for the $^{10}$B($n, α$)$^{7}$Li reaction in the neutron energy range from 1.0 eV to 2.5 MeV

Differential and angle-integrated cross sections for the $^{10}$B($n, α$)$^{7}$Li, $^{10}$B($n, α$$_{0}$)$^{7}$Li and $^{10}$B($n, α$$_{1}$)$^{7}$Li$^{*}$ reactions have been measured at CSNS Back-n white neutron source. Two enriched (90%) $^{10}$B samples 5.0 cm in diameter and ~85.0 $μ$g/cm$^{2}$ in thickness each with an aluminum backing were prepared, and back-to-back mounted at the sample holder. The charged particles were detected using the silicon-detector array of the Light-charged Particle Detector Array (LPDA) system. The neutron energy E$_{n}$ was determined by TOF (time-of-flight) method, and the valid $α$ events were extracted from the E$_{n}$-Amplitude two-dimensional spectrum. With 15 silicon detectors, the differential cross sections of $α$-particles were measured from 19.2° to 160.8°. Fitted with the Legendre polynomial series, the ($n, α$) cross sections were obtained through integration. The absolute cross sections were normalized using the standard cross sections of the $^{10}$B($n, α$)$^{7}$Li reaction in the 0.3 - 0.5 MeV neutron energy region. The measurement neutron energy range for the $^{10}$B($n, α$)$^{7}$Li reaction is 1.0 eV $\le$ En < 2.5 MeV (67 energy points), and for the $^{10}$B($n, α$$_{0}$)$^{7}$Li and $^{10}$B($n, α$$_{1}$)$^{7}$Li$^{*}$ reactions is 1.0 eV $\le$ En < 1.0 MeV (59 energy points). The present results have been analyzed by the resonance reaction mechanism and the level structure of the $^{11}$B compound system, and compared with existing measurements and evaluations.