Researcher profile

Xuan Cheng

Xuan Cheng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer

Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods' generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.

preprint2026arXiv

GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting

We introduce GaussianSwap, a novel video face swapping framework that constructs a 3D Gaussian Splatting based face avatar from a target video while transferring identity from a source image to the avatar. Conventional video swapping frameworks are limited to generating facial representations in pixel-based formats. The resulting swapped faces exist merely as a set of unstructured pixels without any capacity for animation or interactive manipulation. Our work introduces a paradigm shift from conventional pixel-based video generation to the creation of high-fidelity avatar with swapped faces. The framework first preprocesses target video to extract FLAME parameters, camera poses and segmentation masks, and then rigs 3D Gaussian splats to the FLAME model across frames, enabling dynamic facial control. To ensure identity preserving, we propose an compound identity embedding constructed from three state-of-the-art face recognition models for avatar finetuning. Finally, we render the face-swapped avatar on the background frames to obtain the face-swapped video. Experimental results demonstrate that GaussianSwap achieves superior identity preservation, visual clarity and temporal consistency, while enabling previously unattainable interactive applications.

preprint2026arXiv

TalkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation

Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/

preprint2022arXiv

Channel Self-Supervision for Online Knowledge Distillation

Recently, researchers have shown an increased interest in the online knowledge distillation. Adopting an one-stage and end-to-end training fashion, online knowledge distillation uses aggregated intermediated predictions of multiple peer models for training. However, the absence of a powerful teacher model may result in the homogeneity problem between group peers, affecting the effectiveness of group distillation adversely. In this paper, we propose a novel online knowledge distillation method, \textbf{C}hannel \textbf{S}elf-\textbf{S}upervision for Online Knowledge Distillation (CSS), which structures diversity in terms of input, target, and network to alleviate the homogenization problem. Specifically, we construct a dual-network multi-branch structure and enhance inter-branch diversity through self-supervised learning, adopting the feature-level transformation and augmenting the corresponding labels. Meanwhile, the dual network structure has a larger space of independent parameters to resist the homogenization problem during distillation. Extensive quantitative experiments on CIFAR-100 illustrate that our method provides greater diversity than OKDDip and we also give pretty performance improvement, even over the state-of-the-art such as PCL. The results on three fine-grained datasets (StanfordDogs, StanfordCars, CUB-200-211) also show the significant generalization capability of our approach.

preprint2022arXiv

I^2R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation

In this paper, we present the Intra- and Inter-Human Relation Networks (I^2R-Net) for Multi-Person Pose Estimation. It involves two basic modules. First, the Intra-Human Relation Module operates on a single person and aims to capture Intra-Human dependencies. Second, the Inter-Human Relation Module considers the relation between multiple instances and focuses on capturing Inter-Human interactions. The Inter-Human Relation Module can be designed very lightweight by reducing the resolution of feature map, yet learn useful relation information to significantly boost the performance of the Intra-Human Relation Module. Even without bells and whistles, our method can compete or outperform current competition winners. We conduct extensive experiments on COCO, CrowdPose, and OCHuman datasets. The results demonstrate that the proposed model surpasses all the state-of-the-art methods. Concretely, the proposed method achieves 77.4% AP on CrowPose dataset and 67.8% AP on OCHuman dataset respectively, outperforming existing methods by a large margin. Additionally, the ablation study and visualization analysis also prove the effectiveness of our model.

preprint2022arXiv

Selective Output Smoothing Regularization: Regularize Neural Networks by Softening Output Distributions

In this paper, we propose Selective Output Smoothing Regularization, a novel regularization method for training the Convolutional Neural Networks (CNNs). Inspired by the diverse effects on training from different samples, Selective Output Smoothing Regularization improves the performance by encouraging the model to produce equal logits on incorrect classes when dealing with samples that the model classifies correctly and over-confidently. This plug-and-play regularization method can be conveniently incorporated into almost any CNN-based project without extra hassle. Extensive experiments have shown that Selective Output Smoothing Regularization consistently achieves significant improvement in image classification benchmarks, such as CIFAR-100, Tiny ImageNet, ImageNet, and CUB-200-2011. Particularly, our method obtains 77.30% accuracy on ImageNet with ResNet-50, which gains 1.1% than baseline (76.2%). We also empirically demonstrate the ability of our method to make further improvements when combining with other widely used regularization techniques. On Pascal detection, using the SOSR-trained ImageNet classifier as the pretrained model leads to better detection performances.

preprint2022arXiv

Temporally Resolution Decrement: Utilizing the Shape Consistency for Higher Computational Efficiency

Image resolution that has close relations with accuracy and computational cost plays a pivotal role in network training. In this paper, we observe that the reduced image retains relatively complete shape semantics but loses extensive texture information. Inspired by the consistency of the shape semantics as well as the fragility of the texture information, we propose a novel training strategy named Temporally Resolution Decrement. Wherein, we randomly reduce the training images to a smaller resolution in the time domain. During the alternate training with the reduced images and the original images, the unstable texture information in the images results in a weaker correlation between the texture-related patterns and the correct label, naturally enforcing the model to rely more on shape properties that are robust and conform to the human decision rule. Surprisingly, our approach greatly improves both the training and inference efficiency of convolutional neural networks. On ImageNet classification, using only 33\% calculation quantity (randomly reducing the training image to 112$\times$112 within 90\% epochs) can still improve ResNet-50 from 76.32\% to 77.71\%. Superimposed with the strong training procedure of ResNet-50 on ImageNet, our method achieves 80.42\% top-1 accuracy with saving 37.5\% calculation overhead. To the best of our knowledge this is the highest ImageNet single-crop accuracy on ResNet-50 under 224$\times$224 without extra data or distillation.

preprint2022arXiv

White Paper Assistance: A Step Forward Beyond the Shortcut Learning

The promising performances of CNNs often overshadow the need to examine whether they are doing in the way we are actually interested. We show through experiments that even over-parameterized models would still solve a dataset by recklessly leveraging spurious correlations, or so-called 'shortcuts'. To combat with this unintended propensity, we borrow the idea of printer test page and propose a novel approach called White Paper Assistance. Our proposed method involves the white paper to detect the extent to which the model has preference for certain characterized patterns and alleviates it by forcing the model to make a random guess on the white paper. We show the consistent accuracy improvements that are manifest in various architectures, datasets and combinations with other techniques. Experiments have also demonstrated the versatility of our approach on fine-grained recognition, imbalanced classification and robustness to corruptions.

preprint2021arXiv

Predicting nanocrystal morphology governed by interfacial strain

The shape dependence for the technologically important nickel oxide (NiO) nanocrystals on (001) strontium titanate substrates is investigated under the generalized Wulff-Kaichew (GWK) theorem framework. It is found that the shape of the NiO nanocrystals is primarily governed by the existence (or absence) of interfacial strain. Nanocrystals that have a fully pseudomorphic interface with the substrate (i.e. the epitaxial strain is not relaxed) form an embedded smooth ball-crown morphology with {001}, {011}, {111} and high-index {113} exposed facets with a negative Wulff point. On the other hand, when the interfacial strain is relaxed by misfit dislocations, the nanocrystals take on a truncated pyramidal shape, bounded by {111} faces and a {001} flat top, with a positive Wulff point. Our quantitative model is able to predict both experimentally observed shapes and sizes with good accuracy. Given the increasing demand for hetero-epitaxial nanocrystals in various physio-chemical and electro-chemical functional devices, these results lay the important groundwork in exploiting the GWK theorem as a general analytical approach to explain hetero-epitaxial nanocrystal growth on oxide substrates governed by interface strain.

preprint2021arXiv

Super-R BiFeO$_3$: Epitaxial stabilization of a low-symmetry phase with giant electromechanical response

Piezoelectrics interconvert mechanical energy and electric charge and are widely used in actuators and sensors. The best performing materials are ferroelectrics at a morphotropic phase boundary (MPB), where several phases can intimately coexist. Switching between these phases by electric field produces a large electromechanical response. In the ferroelectric BiFeO$_3$, strain can be used to create an MPB-like phase mixture and thus to generate large electric field dependent strains. However, this enhanced response occurs at localized, randomly positioned regions of the film, which potentially complicates nanodevice design. Here, we use epitaxial strain and orientation engineering in tandem - anisotropic epitaxy - to craft a hitherto unavailable low-symmetry phase of BiFeO$_3$ which acts as a structural bridge between the rhombohedral-like and tetragonal-like polymorphs. Interferometric displacement sensor measurements and first-principle calculations reveal that under external electric bias, this phase undergoes a transition to the tetragonal-like polymorph, generating a piezoelectric response enhanced by over 200%, and associated giant field-induced reversible strain. These results offer a new route to engineer giant electromechanical properties in thin films, with broader perspectives for other functional oxide systems.