Source author record

Hao Zhao

Hao Zhao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision quant-ph cond-mat.mes-hall Artificial Intelligence cond-mat.soft Multimedia physics.ins-det

Catalog footprint

What is connected

11works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/

preprint2022arXiv

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

Multi-task indoor scene understanding is widely considered as an intriguing formulation, as the affinity of different tasks may lead to improved performance. In this paper, we tackle the new problem of joint semantic, affordance and attribute parsing. However, successfully resolving it requires a model to capture long-range dependency, learn from weakly aligned data and properly balance sub-tasks during training. To this end, we propose an attention-based architecture named Cerberus and a tailored training framework. Our method effectively addresses the aforementioned challenges and achieves state-of-the-art performance on all three tasks. Moreover, an in-depth analysis shows concept affinity consistent with human cognition, which inspires us to explore the possibility of weakly supervised learning. Surprisingly, Cerberus achieves strong results using only 0.1%-1% annotation. Visualizations further confirm that this success is credited to common attention maps across tasks. Code and models can be accessed at https://github.com/OPEN-AIR-SUN/Cerberus.

preprint2022arXiv

Dissolved gas monitoring probe without liquid-gas separation under strong electromagnetic interference

Rapid and direct monitoring of dissolved gases in liquids under strong electromagnetic interference is very important. Electronic gas sensors that can be placed into liquids are difficult to work reliably under strong electromagnetic fields. The existing optical monitoring techniques for dissolved gases all require gas-liquid separation, and are conducted in gas phase, which have poor timeliness. In this paper, a dissolved gas monitoring probe without liquid-gas separation under strong electromagnetic interference is proposed. We take transformer oil-dissolved acetylene monitoring as an example, an oil-core photonic crystal fiber photothermal interferometry probe is proposed and demonstrates the feasibility of trace oil-dissolved acetylene directly monitoring without oil-gas separation. The minimum detection limit reaches 1.4 ppm, and the response time is about 70 minutes. Due to the good insulation performance and the compact size, the all-fiber probe provides applicability to be placed into transformer oil and perform on-line monitoring under strong electromagnetic interference.

preprint2022arXiv

Distance-Aware Occlusion Detection with Focused Attention

For humans, understanding the relationships between objects using visual signals is intuitive. For artificial intelligence, however, this task remains challenging. Researchers have made significant progress studying semantic relationship detection, such as human-object interaction detection and visual relationship detection. We take the study of visual relationships a step further from semantic to geometric. In specific, we predict relative occlusion and relative distance relationships. However, detecting these relationships from a single image is challenging. Enforcing focused attention to task-specific regions plays a critical role in successfully detecting these relationships. In this work, (1) we propose a novel three-decoder architecture as the infrastructure for focused attention; 2) we use the generalized intersection box prediction task to effectively guide our model to focus on occlusion-specific regions; 3) our model achieves a new state-of-the-art performance on distance-aware relationship detection. Specifically, our model increases the distance F1-score from 33.8% to 38.6% and boosts the occlusion F1-score from 34.4% to 41.2%. Our code is publicly available.

preprint2022arXiv

Language-guided Semantic Style Transfer of 3D Indoor Scenes

We address the new problem of language-guided semantic style transfer of 3D indoor scenes. The input is a 3D indoor scene mesh and several phrases that describe the target scene. Firstly, 3D vertex coordinates are mapped to RGB residues by a multi-layer perceptron. Secondly, colored 3D meshes are differentiablly rendered into 2D images, via a viewpoint sampling strategy tailored for indoor scenes. Thirdly, rendered 2D images are compared to phrases, via pre-trained vision-language models. Lastly, errors are back-propagated to the multi-layer perceptron to update vertex colors corresponding to certain semantic categories. We did large-scale qualitative analyses and A/B user tests, with the public ScanNet and SceneNN datasets. We demonstrate: (1) visually pleasing results that are potentially useful for multimedia applications. (2) rendering 3D indoor scenes from viewpoints consistent with human priors is important. (3) incorporating semantics significantly improve style transfer quality. (4) an HSV regularization term leads to results that are more consistent with inputs and generally rated better. Codes and user study toolbox are available at https://github.com/AIR-DISCOVER/LASST

preprint2022arXiv

PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds

3D scene understanding from point clouds plays a vital role for various robotic applications. Unfortunately, current state-of-the-art methods use separate neural networks for different tasks like object detection or room layout estimation. Such a scheme has two limitations: 1) Storing and running several networks for different tasks are expensive for typical robotic platforms. 2) The intrinsic structure of separate outputs are ignored and potentially violated. To this end, we propose the first transformer architecture that predicts 3D objects and layouts simultaneously, using point cloud inputs. Unlike existing methods that either estimate layout keypoints or edges, we directly parameterize room layout as a set of quads. As such, the proposed architecture is termed as P(oint)Q(uad)-Transformer. Along with the novel quad representation, we propose a tailored physical constraint loss function that discourages object-layout interference. The quantitative and qualitative evaluations on the public benchmark ScanNet show that the proposed PQ-Transformer succeeds to jointly parse 3D objects and layouts, running at a quasi-real-time (8.91 FPS) rate without efficiency-oriented optimization. Moreover, the new physical constraint loss can improve strong baselines, and the F1-score of the room layout is significantly promoted from 37.9% to 57.9%.

preprint2021arXiv

Quantum mechanics of fermion confined to a curved surface in Foldy-Wouthuysen representation

In Foldy-Wouthuysen representation, we deduce the effective quantum mechanics for a particle confined to a curved surface by using the thin-layer quantization scheme. We find that the spin effect caused by confined potential as the results of relativistic correction in the non-relativistic limit. Furthermore, the spin connection appeared in curved surface which depends on curvature contributes a Zeeman-like gap in the relativistic correction term. In addition, the confined potential also induces a curvature-independent energy shift, which is from the zitterbewegung effect. As an example, we apply the effective Hamiltonian to torus surface, in which we obtain expectantly the spin effects related to confined potential. Those results directly demonstrate the scaling of the uncommutation of the non-relativistic limit and the thin-layer quantization formalism

preprint2020arXiv

Constrained R-CNN: A general image manipulation detection model

Recently, deep learning-based models have exhibited remarkable performance for image manipulation detection. However, most of them suffer from poor universality of handcrafted or predetermined features. Meanwhile, they only focus on manipulation localization and overlook manipulation classification. To address these issues, we propose a coarse-to-fine architecture named Constrained R-CNN for complete and accurate image forensics. First, the learnable manipulation feature extractor learns a unified feature representation directly from data. Second, the attention region proposal network effectively discriminates manipulated regions for the next manipulation classification and coarse localization. Then, the skip structure fuses low-level and high-level information to refine the global manipulation features. Finally, the coarse localization information guides the model to further learn the finer local features and segment out the tampered region. Experimental results show that our model achieves state-of-the-art performance. Especially, the F1 score is increased by 28.4%, 73.2%, 13.3% on the NIST16, COVERAGE, and Columbia dataset.

preprint2020arXiv

Effective dynamics for a spin-1/2 particle constrained to a space curve in an electric and magnetic field

We consider the dynamics of a spin-1/2 particle constrained to move in an arbitrary space curve with an external electric and magnetic field applied. With the aid of gauge theory, we successfully decouple the tangential and normal dynamics and derive the effective Hamiltonian. A new type of quantum potential called SU(2) Zeeman interaction appears, which is induced by the electric field and couples spin and intrinsic orbital angular momentum. Based on the Hamiltonian, we discuss the spin precession for zero intrinsic orbital angular momentum case and the energy splitting caused by the SU(2) Zeeman interaction for a helix as examples, showing the combined effect of geometry and external field. The new interaction may bring new approaches to manipulate quantum states in spintronics.

preprint2020arXiv

Spin polarization of electrons through corrugated surface in magnetic field

Noninteracting electrons confined to a corrugated surface are investigated in magnetic field, and the associated effective Pauli equation is given analytically by the thin-layer quantization scheme. Interestingly, the Zeeman splitting gaps can be adjusted by curvature, and there is a geometric potential induced by curvature. Further, we discuss the spin-dependent transport properties for confined electrons by numerical calculation. More interestingly, we find that the spin polarization induced by curvature becomes substantial when the incident energy has small value. The results are considerable for a spin transistor with small spin current.

preprint2015arXiv

Hard sphere packings within cylinders

The packing of hard spheres (HS) of diameter $σ$ in a cylinder has been used to model experimental systems, such as fullerenes in nanotubes and colloidal wire assembly. Finding the densest packings of HS under this type of confinement, however, grows increasingly complex with the cylinder diameter, $D$. Little is thus known about the densest achievable packings for $D>2.873σ$. In this work, we extend the identification of the packings up to $D=4.00σ$ by adapting Torquato-Jiao's adaptive-shrinking-cell formulation and sequential-linear-programming (SLP) technique. We identify 17 new structures, almost all of them chiral. Beyond $D\approx2.85σ$, most of the structures consist of an outer shell and an inner core that compete for being close packed. In some cases, the shell adopts its own maximum density configuration, and the stacking of core spheres within it is quasiperiodic. In other cases, an interplay between the two components is observed, which may result in simple periodic structures. In yet other cases, the very distinction between core and shell vanishes, resulting in more exotic packing geometries, including some that are three-dimensional extensions of structures obtained from packing hard disks in a circle.

Hao Zhao

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

Dissolved gas monitoring probe without liquid-gas separation under strong electromagnetic interference

Distance-Aware Occlusion Detection with Focused Attention

Language-guided Semantic Style Transfer of 3D Indoor Scenes

PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds

Quantum mechanics of fermion confined to a curved surface in Foldy-Wouthuysen representation

Constrained R-CNN: A general image manipulation detection model

Effective dynamics for a spin-1/2 particle constrained to a space curve in an electric and magnetic field

Spin polarization of electrons through corrugated surface in magnetic field

Hard sphere packings within cylinders