Source author record

Tong Zhu

Tong Zhu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence cond-mat.mtrl-sci

Catalog footprint

What is connected

3works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.

preprint2025arXiv

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

preprint2013arXiv

The stability, electronic structure, and optical property of TiO2 polymorphs

Phonon density of states calculation shows that a new TiO2 polymorph with tridymite structure is mechanically stable. Enthalpies of 9 TiO2 polymorphs under different pressure are presented to study the relative stability of the TiO2 polymorphs. Band structures for the TiO2 polymorphs are calculated by density functional theory with generalized gradient approximation and the band energies at high symmetry k-points are corrected using the GW method to accurately determine the band gap. The differences between direct band gap energies and indirect band gap energies are very small for rutile, columbite and baddeleyite TiO2, indicating a quasi-direct band gap character. The band gap energies of baddeleyite (quasi-direct) and brookite (direct) TiO2 are close to that of anatase (indirect) TiO2. The band gap of the newly predicted tridymite-structured TiO2 is wider than the other 8 polymorphs. For optical response calculations, two-particle effects have been included by solving the Bethe-Salpeter equation for Coulomb correlated electron-hole pairs. TiO2 with cotunnite, pyrite, and fluorite structures have optical transitions in the visible light region.