Researcher profile

Yongjun Xu

Yongjun Xu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing

preprint2026arXiv

Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization.

preprint2022arXiv

Cross-Image Relational Knowledge Distillation for Semantic Segmentation

Current Knowledge Distillation (KD) methods for semantic segmentation often guide the student to mimic the teacher's structured information generated from individual data samples. However, they ignore the global semantic relations among pixels across various images that are valuable for KD. This paper proposes a novel Cross-Image Relational KD (CIRKD), which focuses on transferring structured pixel-to-pixel and pixel-to-region relations among the whole images. The motivation is that a good teacher network could construct a well-structured feature space in terms of global pixel dependencies. CIRKD makes the student mimic better structured semantic relations from the teacher, thus improving the segmentation performance. Experimental results over Cityscapes, CamVid and Pascal VOC datasets demonstrate the effectiveness of our proposed approach against state-of-the-art distillation methods. The code is available at https://github.com/winycg/CIRKD.

preprint2022arXiv

Lifelong Generative Learning via Knowledge Reconstruction

Generative models often incur the catastrophic forgetting problem when they are used to sequentially learning multiple tasks, i.e., lifelong generative learning. Although there are some endeavors to tackle this problem, they suffer from high time-consumptions or error accumulation. In this work, we develop an efficient and effective lifelong generative model based on variational autoencoder (VAE). Unlike the generative adversarial network, VAE enjoys high efficiency in the training process, providing natural benefits with few resources. We deduce a lifelong generative model by expending the intrinsic reconstruction character of VAE to the historical knowledge retention. Further, we devise a feedback strategy about the reconstructed data to alleviate the error accumulation. Experiments on the lifelong generating tasks of MNIST, FashionMNIST, and SVHN verified the efficacy of our approach, where the results were comparable to SOTA.

preprint2022arXiv

Localizing Semantic Patches for Accelerating Image Classification

Existing works often focus on reducing the architecture redundancy for accelerating image classification but ignore the spatial redundancy of the input image. This paper proposes an efficient image classification pipeline to solve this problem. We first pinpoint task-aware regions over the input image by a lightweight patch proposal network called AnchorNet. We then feed these localized semantic patches with much smaller spatial redundancy into a general classification network. Unlike the popular design of deep CNN, we aim to carefully design the Receptive Field of AnchorNet without intermediate convolutional paddings. This ensures the exact mapping from a high-level spatial location to the specific input image patch. The contribution of each patch is interpretable. Moreover, AnchorNet is compatible with any downstream architecture. Experimental results on ImageNet show that our method outperforms SOTA dynamic inference methods with fewer inference costs. Our code is available at https://github.com/winycg/AnchorNet.

preprint2022arXiv

MixSKD: Self-Knowledge Distillation from Mixup for Image Recognition

Unlike the conventional Knowledge Distillation (KD), Self-KD allows a network to learn knowledge from itself without any guidance from extra networks. This paper proposes to perform Self-KD from image Mixture (MixSKD), which integrates these two techniques into a unified framework. MixSKD mutually distills feature maps and probability distributions between the random pair of original images and their mixup images in a meaningful way. Therefore, it guides the network to learn cross-image knowledge by modelling supervisory signals from mixup images. Moreover, we construct a self-teacher network by aggregating multi-stage feature maps for providing soft labels to supervise the backbone classifier, further improving the efficacy of self-boosting. Experiments on image classification and transfer learning to object detection and semantic segmentation demonstrate that MixSKD outperforms other state-of-the-art Self-KD and data augmentation methods. The code is available at https://github.com/winycg/Self-KD-Lib.

preprint2022arXiv

Mutual Contrastive Learning for Visual Representation Learning

We present a collaborative learning method called Mutual Contrastive Learning (MCL) for general visual representation learning. The core idea of MCL is to perform mutual interaction and transfer of contrastive distributions among a cohort of networks. A crucial component of MCL is Interactive Contrastive Learning (ICL). Compared with vanilla contrastive learning, ICL can aggregate cross-network embedding information and maximize the lower bound to the mutual information between two networks. This enables each network to learn extra contrastive knowledge from others, leading to better feature representations for visual recognition tasks. We emphasize that the resulting MCL is conceptually simple yet empirically powerful. It is a generic framework that can be applied to both supervised and self-supervised representation learning. Experimental results on image classification and transfer learning to object detection show that MCL can lead to consistent performance gains, demonstrating that MCL can guide the network to generate better feature representations. Code is available at https://github.com/winycg/MCL.

preprint2022arXiv

PFGDF: Pruning Filter via Gaussian Distribution Feature for Deep Neural Networks Acceleration

Deep learning has achieved impressive results in many areas, but the deployment of edge intelligent devices is still very slow. To solve this problem, we propose a novel compression and acceleration method based on data distribution characteristics for deep neural networks, namely Pruning Filter via Gaussian Distribution Feature (PFGDF). Compared with previous advanced pruning methods, PFGDF compresses the model by filters with insignificance in distribution, regardless of the contribution and sensitivity information of the convolution filter. PFGDF is significantly different from weight sparsification pruning because it does not require the special accelerated library to process the sparse weight matrix and introduces no more extra parameters. The pruning process of PFGDF is automated. Furthermore, the model compressed by PFGDF can restore the same performance as the uncompressed model. We evaluate PFGDF through extensive experiments, on CIFAR-10, PFGDF compresses the convolution filter on VGG-16 by 66.62% with more than 90% parameter reduced, while the inference time is accelerated by 83.73% on Huawei MATE 10.

preprint2022arXiv

Pre-training Enhanced Spatial-temporal Graph Neural Network for Multivariate Time Series Forecasting

Multivariate Time Series (MTS) forecasting plays a vital role in a wide range of applications. Recently, Spatial-Temporal Graph Neural Networks (STGNNs) have become increasingly popular MTS forecasting methods. STGNNs jointly model the spatial and temporal patterns of MTS through graph neural networks and sequential models, significantly improving the prediction accuracy. But limited by model complexity, most STGNNs only consider short-term historical MTS data, such as data over the past one hour. However, the patterns of time series and the dependencies between them (i.e., the temporal and spatial patterns) need to be analyzed based on long-term historical MTS data. To address this issue, we propose a novel framework, in which STGNN is Enhanced by a scalable time series Pre-training model (STEP). Specifically, we design a pre-training model to efficiently learn temporal patterns from very long-term history time series (e.g., the past two weeks) and generate segment-level representations. These representations provide contextual information for short-term time series input to STGNNs and facilitate modeling dependencies between time series. Experiments on three public real-world datasets demonstrate that our framework is capable of significantly enhancing downstream STGNNs, and our pre-training model aptly captures temporal patterns.

preprint2022arXiv

Spatial-Temporal Identity: A Simple yet Effective Baseline for Multivariate Time Series Forecasting

Multivariate Time Series (MTS) forecasting plays a vital role in a wide range of applications. Recently, Spatial-Temporal Graph Neural Networks (STGNNs) have become increasingly popular MTS forecasting methods due to their state-of-the-art performance. However, recent works are becoming more sophisticated with limited performance improvements. This phenomenon motivates us to explore the critical factors of MTS forecasting and design a model that is as powerful as STGNNs, but more concise and efficient. In this paper, we identify the indistinguishability of samples in both spatial and temporal dimensions as a key bottleneck, and propose a simple yet effective baseline for MTS forecasting by attaching Spatial and Temporal IDentity information (STID), which achieves the best performance and efficiency simultaneously based on simple Multi-Layer Perceptrons (MLPs). These results suggest that we can design efficient and effective models as long as they solve the indistinguishability of samples, without being limited to STGNNs.

preprint2020arXiv

DRNet: Dissect and Reconstruct the Convolutional Neural Network via Interpretable Manners

Convolutional neural networks (ConvNets) are widely used in real life. People usually use ConvNets which pre-trained on a fixed number of classes. However, for different application scenarios, we usually do not need all of the classes, which means ConvNets are redundant when dealing with these tasks. This paper focuses on the redundancy of ConvNet channels. We proposed a novel idea: using an interpretable manner to find the most important channels for every single class (dissect), and dynamically run channels according to classes in need (reconstruct). For VGG16 pre-trained on CIFAR-10, we only run 11\% parameters for two-classes sub-tasks on average with negligible accuracy loss. For VGG16 pre-trained on ImageNet, our method averagely gains 14.29\% accuracy promotion for two-classes sub-tasks. In addition, analysis show that our method captures some semantic meanings of channels, and uses the context information more targeted for sub-tasks of ConvNets.

preprint2020arXiv

Localizing Interpretable Multi-scale informative Patches Derived from Media Classification Task

Deep convolutional neural networks (CNN) always depend on wider receptive field (RF) and more complex non-linearity to achieve state-of-the-art performance, while suffering the increased difficult to interpret how relevant patches contribute the final prediction. In this paper, we construct an interpretable AnchorNet equipped with our carefully designed RFs and linearly spatial aggregation to provide patch-wise interpretability of the input media meanwhile localizing multi-scale informative patches only supervised on media-level labels without any extra bounding box annotations. Visualization of localized informative image and text patches show the superior multi-scale localization capability of AnchorNet. We further use localized patches for downstream classification tasks across widely applied networks. Experimental results demonstrate that replacing the original inputs with their patches for classification can get a clear inference acceleration with only tiny performance degradation, which proves that localized patches can indeed retain the most semantics and evidences of the original inputs.

preprint2020arXiv

Towards Green Mobile Edge Computing Offloading Systems with Security Enhancement

Mobile edge computing (MEC) is an emerging communication scheme that aims at reducing latency. In this paper, we investigate a green MEC system under the existence of an eavesdropper. We use computation efficiency, which is defined as the total number of bits computed divided by the total energy consumption, as our main metric. To alleviate stringent latency requirement, a joint secure offloading and computation model is proposed. Additionally, we formulate an optimization problem for maximizing computation efficiency, under several practical constraints. The non-convex problem is tackled by successive convex approximation and an iterative algorithm. Simulation results have verified the superiority of our proposed scheme, as well as the effectiveness of our problem solution.

preprint2020arXiv

Towards More Efficient and Effective Inference: The Joint Decision of Multi-Participants

Existing approaches to improve the performances of convolutional neural networks by optimizing the local architectures or deepening the networks tend to increase the size of models significantly. In order to deploy and apply the neural networks to edge devices which are in great demand, reducing the scale of networks are quite crucial. However, It is easy to degrade the performance of image processing by compressing the networks. In this paper, we propose a method which is suitable for edge devices while improving the efficiency and effectiveness of inference. The joint decision of multi-participants, mainly contain multi-layers and multi-networks, can achieve higher classification accuracy (0.26% on CIFAR-10 and 4.49% on CIFAR-100 at most) with similar total number of parameters for classical convolutional neural networks.