Researcher profile

Jie Guo

Jie Guo contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2026arXiv

UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.

preprint2022arXiv

Completing Partial Point Clouds with Outliers by Collaborative Completion and Segmentation

Most existing point cloud completion methods are only applicable to partial point clouds without any noises and outliers, which does not always hold in practice. We propose in this paper an end-to-end network, named CS-Net, to complete the point clouds contaminated by noises or containing outliers. In our CS-Net, the completion and segmentation modules work collaboratively to promote each other, benefited from our specifically designed cascaded structure. With the help of segmentation, more clean point cloud is fed into the completion module. We design a novel completion decoder which harnesses the labels obtained by segmentation together with FPS to purify the point cloud and leverages KNN-grouping for better generation. The completion and segmentation modules work alternately share the useful information from each other to gradually improve the quality of prediction. To train our network, we build a dataset to simulate the real case where incomplete point clouds contain outliers. Our comprehensive experiments and comparisons against state-of-the-art completion methods demonstrate our superiority. We also compare with the scheme of segmentation followed by completion and their end-to-end fusion, which also proves our efficacy.

preprint2022arXiv

Deep Graph Learning for Spatially-Varying Indoor Lighting Prediction

Lighting prediction from a single image is becoming increasingly important in many vision and augmented reality (AR) applications in which shading and shadow consistency between virtual and real objects should be guaranteed. However, this is a notoriously ill-posed problem, especially for indoor scenarios, because of the complexity of indoor luminaires and the limited information involved in 2D images. In this paper, we propose a graph learning-based framework for indoor lighting estimation. At its core is a new lighting model (dubbed DSGLight) based on depth-augmented Spherical Gaussians (SG) and a Graph Convolutional Network (GCN) that infers the new lighting representation from a single LDR image of limited field-of-view. Our lighting model builds 128 evenly distributed SGs over the indoor panorama, where each SG encoding the lighting and the depth around that node. The proposed GCN then learns the mapping from the input image to DSGLight. Compared with existing lighting models, our DSGLight encodes both direct lighting and indirect environmental lighting more faithfully and compactly. It also makes network training and inference more stable. The estimated depth distribution enables temporally stable shading and shadows under spatially-varying lighting. Through thorough experiments, we show that our method obviously outperforms existing methods both qualitatively and quantitatively.

preprint2022arXiv

Deep Point Cloud Simplification for High-quality Surface Reconstruction

The growing size of point clouds enlarges consumptions of storage, transmission, and computation of 3D scenes. Raw data is redundant, noisy, and non-uniform. Therefore, simplifying point clouds for achieving compact, clean, and uniform points is becoming increasingly important for 3D vision and graphics tasks. Previous learning based methods aim to generate fewer points for scene understanding, regardless of the quality of surface reconstruction, leading to results with low reconstruction accuracy and bad point distribution. In this paper, we propose a novel point cloud simplification network (PCS-Net) dedicated to high-quality surface mesh reconstruction while maintaining geometric fidelity. We first learn a sampling matrix in a feature-aware simplification module to reduce the number of points. Then we propose a novel double-scale resampling module to refine the positions of the sampled points, to achieve a uniform distribution. To further retain important shape features, an adaptive sampling strategy with a novel saliency loss is designed. With our PCS-Net, the input non-uniform and noisy point cloud can be simplified in a feature-aware manner, i.e., points near salient features are consolidated but still with uniform distribution locally. Experiments demonstrate the effectiveness of our method and show that we outperform previous simplification or reconstruction-oriented upsampling methods.

preprint2022arXiv

GLPanoDepth: Global-to-Local Panoramic Depth Estimation

In this paper, we propose a learning-based method for predicting dense depth values of a scene from a monocular omnidirectional image. An omnidirectional image has a full field-of-view, providing much more complete descriptions of the scene than perspective images. However, fully-convolutional networks that most current solutions rely on fail to capture rich global contexts from the panorama. To address this issue and also the distortion of equirectangular projection in the panorama, we propose Cubemap Vision Transformers (CViT), a new transformer-based architecture that can model long-range dependencies and extract distortion-free global features from the panorama. We show that cubemap vision transformers have a global receptive field at every stage and can provide globally coherent predictions for spherical signals. To preserve important local features, we further design a convolution-based branch in our pipeline (dubbed GLPanoDepth) and fuse global features from cubemap vision transformers at multiple scales. This global-to-local strategy allows us to fully exploit useful global and local features in the panorama, achieving state-of-the-art performance in panoramic depth estimation.

preprint2022arXiv

TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction

This technical report presents an effective method for motion prediction in autonomous driving. We develop a Transformer-based method for input encoding and trajectory prediction. Besides, we propose the Temporal Flow Header to enhance the trajectory encoding. In the end, an efficient K-means ensemble method is used. Using our Transformer network and ensemble method, we win the first place of Argoverse 2 Motion Forecasting Challenge with the state-of-the-art brier-minFDE score of 1.90.

preprint2021arXiv

Rendering Discrete Participating Media with Geometrical Optics Approximation

We consider the scattering of light in participating media composed of sparsely and randomly distributed discrete particles. The particle size is expected to range from the scale of the wavelength to the scale several orders of magnitude greater than the wavelength, and the appearance shows distinct graininess as opposed to the smooth appearance of continuous media. One fundamental issue in physically-based synthesizing this appearance is to determine necessary optical properties in every local region. Since these optical properties vary spatially, we resort to geometrical optics approximation (GOA), a highly efficient alternative to rigorous Lorenz-Mie theory, to quantitatively represent the scattering of a single particle. This enables us to quickly compute bulk optical properties according to any particle size distribution. Then, we propose a practical Monte Carlo rendering solution to solve the transfer of energy in discrete participating media. Results show that for the first time our proposed framework can simulate a wide range of discrete participating media with different levels of graininess and converges to continuous media as the particle concentration increases.

preprint2021arXiv

Temporal Alignment Prediction for Few-Shot Video Classification

The goal of few-shot video classification is to learn a classification model with good generalization ability when trained with only a few labeled videos. However, it is difficult to learn discriminative feature representations for videos in such a setting. In this paper, we propose Temporal Alignment Prediction (TAP) based on sequence similarity learning for few-shot video classification. In order to obtain the similarity of a pair of videos, we predict the alignment scores between all pairs of temporal positions in the two videos with the temporal alignment prediction function. Besides, the inputs to this function are also equipped with the context information in the temporal domain. We evaluate TAP on two video classification benchmarks including Kinetics and Something-Something V2. The experimental results verify the effectiveness of TAP and show its superiority over state-of-the-art methods.

preprint2020arXiv

FA-GANs: Facial Attractiveness Enhancement with Generative Adversarial Networks on Frontal Faces

Facial attractiveness enhancement has been an interesting application in Computer Vision and Graphics over these years. It aims to generate a more attractive face via manipulations on image and geometry structure while preserving face identity. In this paper, we propose the first Generative Adversarial Networks (GANs) for enhancing facial attractiveness in both geometry and appearance aspects, which we call "FA-GANs". FA-GANs contain two branches and enhance facial attractiveness in two perspectives: facial geometry and facial appearance. Each branch consists of individual GANs with the appearance branch adjusting the facial image and the geometry branch adjusting the facial landmarks in appearance and geometry aspects, respectively. Unlike the traditional facial manipulations learning from paired faces, which are infeasible to collect before and after enhancement of the same individual, we achieve this by learning the features of attractiveness faces through unsupervised adversarial learning. The proposed FA-GANs are able to extract attractiveness features and impose them on the enhancement results. To better enhance faces, both the geometry and appearance networks are considered to refine the facial attractiveness by adjusting the geometry layout of faces and the appearance of faces independently. To the best of our knowledge, we are the first to enhance the facial attractiveness with GANs in both geometry and appearance aspects. The experimental results suggest that our FA-GANs can generate compelling perceptual results in both geometry structure and facial appearance and outperform current state-of-the-art methods.

preprint2020arXiv

Hybrid Models for Open Set Recognition

Open set recognition requires a classifier to detect samples not belonging to any of the classes in its training set. Existing methods fit a probability distribution to the training samples on their embedding space and detect outliers according to this distribution. The embedding space is often obtained from a discriminative classifier. However, such discriminative representation focuses only on known classes, which may not be critical for distinguishing the unknown classes. We argue that the representation space should be jointly learned from the inlier classifier and the density estimator (served as an outlier detector). We propose the OpenHybrid framework, which is composed of an encoder to encode the input data into a joint embedding space, a classifier to classify samples to inlier classes, and a flow-based density estimator to detect whether a sample belongs to the unknown category. A typical problem of existing flow-based models is that they may assign a higher likelihood to outliers. However, we empirically observe that such an issue does not occur in our experiments when learning a joint representation for discriminative and generative components. Experiments on standard open set benchmarks also reveal that an end-to-end trained OpenHybrid model significantly outperforms state-of-the-art methods and flow-based baselines.

preprint2020arXiv

Partially Observable Online Change Detection via Smooth-Sparse Decomposition

We consider online change detection of high dimensional data streams with sparse changes, where only a subset of data streams can be observed at each sensing time point due to limited sensing capacities. On the one hand, the detection scheme should be able to deal with partially observable data and meanwhile have efficient detection power for sparse changes. On the other, the scheme should be able to adaptively and actively select the most important variables to observe to maximize the detection power. To address these two points, in this paper, we propose a novel detection scheme called CDSSD. In particular, it describes the structure of high dimensional data with sparse changes by smooth-sparse decomposition, whose parameters can be learned via spike-slab variational Bayesian inference. Then the posterior Bayes factor, which incorporates the learned parameters and sparse change information, is formulated as a detection statistic. Finally, by formulating the statistic as the reward of a combinatorial multi-armed bandit problem, an adaptive sampling strategy based on Thompson sampling is proposed. The efficacy and applicability of our method in practice are demonstrated with numerical studies and a real case study.

preprint2018arXiv

Dimensional crossover of heat conduction in amorphous Polyimide nanofibers

The mechanism of thermal conductivity in amorphous polymers, especially polymer fibers, is unclear in comparison with that in inorganic materials. Here, we report the observation of across over of heat conduction behavior from three dimensions (3D) to quasi-one dimension (1D) in Polyimide(PI) nanofibers at a given temperature. A theoretical model based on the random walk theory has been proposed to quantitatively describe the interplay between the inter-chain hopping and the intra-chain hopping in nanofibers. This model explains well the diameter dependence of thermal conductivity and also speculates the upper limit of thermal conductivity of amorphous polymers in the quasi-1D limit.