Source author record

Zhichao Zhang

Zhichao Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Distributed, Parallel, and Cluster Computing eess.AS eess.IV eess.SP Graphics Machine Learning Multimedia Sound

Catalog footprint

What is connected

4works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Enhancing Blind Video Quality Assessment with Rich Quality-aware Features

Blind video quality assessment (BVQA) is a highly challenging task due to the intrinsic complexity of video content and visual distortions, especially given the high popularity of social media videos, which originate from a wide range of sources, and are often processed by various compression and enhancement algorithms. While recent BVQA and blind image quality assessment (BIQA) studies have made remarkable progress, their models typically perform well on the datasets they were trained on but generalize poorly to unseen videos, making them less effective for accurately evaluating the perceptual quality of diverse social media videos. In this paper, we propose Rich Quality-aware features enabled Video Quality Assessment (RQ-VQA), a simple yet effective method to enhance BVQA by leveraging rich quality-aware features extracted from off-the-shelf BIQA and BVQA models. Our approach exploits the expertise of existing quality assessment models within their trained domains to improve generalization. Specifically, we design a multi-source feature framework that integrates:(1) Learnable spatial features} from a base model fine-tuned on the target VQA dataset to capture domain-specific quality cues; (2) Temporal motion features from the fast pathway of SlowFast pre-trained on action recognition datasets to model motion-related distortions; (3) Spatial quality-aware features from BIQA models trained on diverse IQA datasets to enhance frame-level distortion representation; and (4) Spatiotemporal quality-aware features from a BVQA model trained on large-scale VQA datasets to jointly encode spatial structure and temporal dynamics. These features are concatenated and fed into a multi-layer perceptron (MLP) to regress them into quality scores. Experimental results demonstrate that our model achieves state-of-the-art performance on three public social media VQA datasets.

preprint2025arXiv

PartMotionEdit: Fine-Grained Text-Driven 3D Human Motion Editing via Part-Level Modulation

Existing text-driven 3D human motion editing methods have demonstrated significant progress, but are still difficult to precisely control over detailed, part-specific motions due to their global modeling nature. In this paper, we propose PartMotionEdit, a novel fine-grained motion editing framework that operates via part-level semantic modulation. The core of PartMotionEdit is a Part-aware Motion Modulation (PMM) module, which builds upon a predefined five-part body decomposition. PMM dynamically predicts time-varying modulation weights for each body part, enabling precise and interpretable editing of local motions. To guide the training of PMM, we also introduce a part-level similarity curve supervision mechanism enhanced with dual-layer normalization. This mechanism assists PMM in learning semantically consistent and editable distributions across all body parts. Furthermore, we design a Bidirectional Motion Interaction (BMI) module. It leverages bidirectional cross-modal attention to achieve more accurate semantic alignment between textual instructions and motion semantics. Extensive quantitative and qualitative evaluations on a well-known benchmark demonstrate that PartMotionEdit outperforms the state-of-the-art methods.

preprint2022arXiv

Revisiting Communication-Efficient Federated Learning with Balanced Global and Local Updates

In federated learning (FL), a number of devices train their local models and upload the corresponding parameters or gradients to the base station (BS) to update the global model while protecting their data privacy. However, due to the limited computation and communication resources, the number of local trainings (a.k.a. local update) and that of aggregations (a.k.a. global update) need to be carefully chosen. In this paper, we investigate and analyze the optimal trade-off between the number of local trainings and that of global aggregations to speed up the convergence and enhance the prediction accuracy over the existing works. Our goal is to minimize the global loss function under both the delay and the energy consumption constraints. In order to make the optimization problem tractable, we derive a new and tight upper bound on the loss function, which allows us to obtain closed-form expressions for the number of local trainings and that of global aggregations. Simulation results show that our proposed scheme can achieve a better performance in terms of the prediction accuracy, and converge much faster than the baseline schemes.

preprint2020arXiv

Learning Frame Level Attention for Environmental Sound Classification

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The classification performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose a convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. We investigated the classification performance when using different attention scaling function and applying different layers. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and our method achieved the state-of-the-art or competitive classification accuracy with lower computational complexity. We also visualized our attention results and observed that the proposed attention mechanism was able to lead the network tofocus on the semantically relevant parts of environmental sounds.