Researcher profile

Hao Zhao

Hao Zhao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/

preprint2022arXiv

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

Multi-task indoor scene understanding is widely considered as an intriguing formulation, as the affinity of different tasks may lead to improved performance. In this paper, we tackle the new problem of joint semantic, affordance and attribute parsing. However, successfully resolving it requires a model to capture long-range dependency, learn from weakly aligned data and properly balance sub-tasks during training. To this end, we propose an attention-based architecture named Cerberus and a tailored training framework. Our method effectively addresses the aforementioned challenges and achieves state-of-the-art performance on all three tasks. Moreover, an in-depth analysis shows concept affinity consistent with human cognition, which inspires us to explore the possibility of weakly supervised learning. Surprisingly, Cerberus achieves strong results using only 0.1%-1% annotation. Visualizations further confirm that this success is credited to common attention maps across tasks. Code and models can be accessed at https://github.com/OPEN-AIR-SUN/Cerberus.

preprint2022arXiv

Dissolved gas monitoring probe without liquid-gas separation under strong electromagnetic interference

Rapid and direct monitoring of dissolved gases in liquids under strong electromagnetic interference is very important. Electronic gas sensors that can be placed into liquids are difficult to work reliably under strong electromagnetic fields. The existing optical monitoring techniques for dissolved gases all require gas-liquid separation, and are conducted in gas phase, which have poor timeliness. In this paper, a dissolved gas monitoring probe without liquid-gas separation under strong electromagnetic interference is proposed. We take transformer oil-dissolved acetylene monitoring as an example, an oil-core photonic crystal fiber photothermal interferometry probe is proposed and demonstrates the feasibility of trace oil-dissolved acetylene directly monitoring without oil-gas separation. The minimum detection limit reaches 1.4 ppm, and the response time is about 70 minutes. Due to the good insulation performance and the compact size, the all-fiber probe provides applicability to be placed into transformer oil and perform on-line monitoring under strong electromagnetic interference.

preprint2022arXiv

Distance-Aware Occlusion Detection with Focused Attention

For humans, understanding the relationships between objects using visual signals is intuitive. For artificial intelligence, however, this task remains challenging. Researchers have made significant progress studying semantic relationship detection, such as human-object interaction detection and visual relationship detection. We take the study of visual relationships a step further from semantic to geometric. In specific, we predict relative occlusion and relative distance relationships. However, detecting these relationships from a single image is challenging. Enforcing focused attention to task-specific regions plays a critical role in successfully detecting these relationships. In this work, (1) we propose a novel three-decoder architecture as the infrastructure for focused attention; 2) we use the generalized intersection box prediction task to effectively guide our model to focus on occlusion-specific regions; 3) our model achieves a new state-of-the-art performance on distance-aware relationship detection. Specifically, our model increases the distance F1-score from 33.8% to 38.6% and boosts the occlusion F1-score from 34.4% to 41.2%. Our code is publicly available.

preprint2022arXiv

Language-guided Semantic Style Transfer of 3D Indoor Scenes

We address the new problem of language-guided semantic style transfer of 3D indoor scenes. The input is a 3D indoor scene mesh and several phrases that describe the target scene. Firstly, 3D vertex coordinates are mapped to RGB residues by a multi-layer perceptron. Secondly, colored 3D meshes are differentiablly rendered into 2D images, via a viewpoint sampling strategy tailored for indoor scenes. Thirdly, rendered 2D images are compared to phrases, via pre-trained vision-language models. Lastly, errors are back-propagated to the multi-layer perceptron to update vertex colors corresponding to certain semantic categories. We did large-scale qualitative analyses and A/B user tests, with the public ScanNet and SceneNN datasets. We demonstrate: (1) visually pleasing results that are potentially useful for multimedia applications. (2) rendering 3D indoor scenes from viewpoints consistent with human priors is important. (3) incorporating semantics significantly improve style transfer quality. (4) an HSV regularization term leads to results that are more consistent with inputs and generally rated better. Codes and user study toolbox are available at https://github.com/AIR-DISCOVER/LASST

preprint2022arXiv

PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds

3D scene understanding from point clouds plays a vital role for various robotic applications. Unfortunately, current state-of-the-art methods use separate neural networks for different tasks like object detection or room layout estimation. Such a scheme has two limitations: 1) Storing and running several networks for different tasks are expensive for typical robotic platforms. 2) The intrinsic structure of separate outputs are ignored and potentially violated. To this end, we propose the first transformer architecture that predicts 3D objects and layouts simultaneously, using point cloud inputs. Unlike existing methods that either estimate layout keypoints or edges, we directly parameterize room layout as a set of quads. As such, the proposed architecture is termed as P(oint)Q(uad)-Transformer. Along with the novel quad representation, we propose a tailored physical constraint loss function that discourages object-layout interference. The quantitative and qualitative evaluations on the public benchmark ScanNet show that the proposed PQ-Transformer succeeds to jointly parse 3D objects and layouts, running at a quasi-real-time (8.91 FPS) rate without efficiency-oriented optimization. Moreover, the new physical constraint loss can improve strong baselines, and the F1-score of the room layout is significantly promoted from 37.9% to 57.9%.

preprint2021arXiv

Quantum mechanics of fermion confined to a curved surface in Foldy-Wouthuysen representation

In Foldy-Wouthuysen representation, we deduce the effective quantum mechanics for a particle confined to a curved surface by using the thin-layer quantization scheme. We find that the spin effect caused by confined potential as the results of relativistic correction in the non-relativistic limit. Furthermore, the spin connection appeared in curved surface which depends on curvature contributes a Zeeman-like gap in the relativistic correction term. In addition, the confined potential also induces a curvature-independent energy shift, which is from the zitterbewegung effect. As an example, we apply the effective Hamiltonian to torus surface, in which we obtain expectantly the spin effects related to confined potential. Those results directly demonstrate the scaling of the uncommutation of the non-relativistic limit and the thin-layer quantization formalism

preprint2020arXiv

Constrained R-CNN: A general image manipulation detection model

Recently, deep learning-based models have exhibited remarkable performance for image manipulation detection. However, most of them suffer from poor universality of handcrafted or predetermined features. Meanwhile, they only focus on manipulation localization and overlook manipulation classification. To address these issues, we propose a coarse-to-fine architecture named Constrained R-CNN for complete and accurate image forensics. First, the learnable manipulation feature extractor learns a unified feature representation directly from data. Second, the attention region proposal network effectively discriminates manipulated regions for the next manipulation classification and coarse localization. Then, the skip structure fuses low-level and high-level information to refine the global manipulation features. Finally, the coarse localization information guides the model to further learn the finer local features and segment out the tampered region. Experimental results show that our model achieves state-of-the-art performance. Especially, the F1 score is increased by 28.4%, 73.2%, 13.3% on the NIST16, COVERAGE, and Columbia dataset.

preprint2020arXiv

Effective dynamics for a spin-1/2 particle constrained to a space curve in an electric and magnetic field

We consider the dynamics of a spin-1/2 particle constrained to move in an arbitrary space curve with an external electric and magnetic field applied. With the aid of gauge theory, we successfully decouple the tangential and normal dynamics and derive the effective Hamiltonian. A new type of quantum potential called SU(2) Zeeman interaction appears, which is induced by the electric field and couples spin and intrinsic orbital angular momentum. Based on the Hamiltonian, we discuss the spin precession for zero intrinsic orbital angular momentum case and the energy splitting caused by the SU(2) Zeeman interaction for a helix as examples, showing the combined effect of geometry and external field. The new interaction may bring new approaches to manipulate quantum states in spintronics.

preprint2020arXiv

Spin polarization of electrons through corrugated surface in magnetic field

Noninteracting electrons confined to a corrugated surface are investigated in magnetic field, and the associated effective Pauli equation is given analytically by the thin-layer quantization scheme. Interestingly, the Zeeman splitting gaps can be adjusted by curvature, and there is a geometric potential induced by curvature. Further, we discuss the spin-dependent transport properties for confined electrons by numerical calculation. More interestingly, we find that the spin polarization induced by curvature becomes substantial when the incident energy has small value. The results are considerable for a spin transistor with small spin current.