Researcher profile

Song Guo

Song Guo contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
20works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

20 published item(s)

preprint2026arXiv

3D Generation for Embodied AI and Robotic Simulation: A Survey

Embodied AI and robotic systems increasingly depend on scalable, diverse, and physically grounded 3D content for simulation-based training and real-world deployment. While 3D generative modeling has advanced rapidly, embodied applications impose requirements far beyond visual realism: generated objects must carry kinematic structure and material properties, scenes must support interaction and task execution, and the resulting content must bridge the gap between simulation and reality. This survey reviews 3D generation for embodied AI and organizes the literature around three roles that 3D generation plays in embodied systems. In Data Generator, 3D generation produces simulation-ready objects and assets, including articulated, physically grounded, and deformable content for downstream interaction; in Simulation Environments, it constructs interactive and task-oriented worlds, spanning structure-aware, controllable, and agentic scene generation; and in Sim2Real Bridge, it supports digital twin reconstruction, data augmentation, and synthetic demonstrations for downstream robot learning and real-world transfer. We also show that the field is shifting from visual realism toward interaction readiness, and we identify the main bottlenecks, including limited physical annotations, the gap between geometric quality and physical validity, fragmented evaluation, and the persistent sim-to-real divide, that must be addressed for 3D generation to become a dependable foundation for embodied intelligence. Our project page is at https://3dgen4robot.github.io.

preprint2026arXiv

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi-tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world-model-based training, which can yield substantial performance gains; and the spontaneous emergence of System-2-style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent-native infrastructure.

preprint2026arXiv

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.

preprint2023arXiv

Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

Federated learning (FL) underpins advancements in privacy-preserving distributed computing by collaboratively training neural networks without exposing clients' raw data. Current FL paradigms primarily focus on uni-modal data, while exploiting the knowledge from distributed multimodal data remains largely unexplored. Existing multimodal FL (MFL) solutions are mainly designed for statistical or modality heterogeneity from the input side, however, have yet to solve the fundamental issue,"modality imbalance", in distributed conditions, which can lead to inadequate information exploitation and heterogeneous knowledge aggregation on different modalities.In this paper, we propose a novel Cross-Modal Infiltration Federated Learning (FedCMI) framework that effectively alleviates modality imbalance and knowledge heterogeneity via knowledge transfer from the global dominant modality. To avoid the loss of information in the weak modality due to merely imitating the behavior of dominant modality, we design the two-projector module to integrate the knowledge from dominant modality while still promoting the local feature exploitation of weak modality. In addition, we introduce a class-wise temperature adaptation scheme to achieve fair performance across different classes. Extensive experiments over popular datasets are conducted and give us a gratifying confirmation of the proposed framework for fully exploring the information of each modality in MFL.

preprint2022arXiv

A Coalition Formation Game Approach for Personalized Federated Learning

Facing the challenge of statistical diversity in client local data distribution, personalized federated learning (PFL) has become a growing research hotspot. Although the state-of-the-art methods with model similarity-based pairwise collaboration have achieved promising performance, they neglect the fact that model aggregation is essentially a collaboration process within the coalition, where the complex multiwise influences take place among clients. In this paper, we first apply Shapley value (SV) from coalition game theory into the PFL scenario. To measure the multiwise collaboration among a group of clients on the personalized learning performance, SV takes their marginal contribution to the final result as a metric. We propose a novel personalized algorithm: pFedSV, which can 1. identify each client's optimal collaborator coalition and 2. perform personalized model aggregation based on SV. Extensive experiments on various datasets (MNIST, Fashion-MNIST, and CIFAR-10) are conducted with different Non-IID data settings (Pathological and Dirichlet). The results show that pFedSV can achieve superior personalized accuracy for each client, compared to the state-of-the-art benchmarks.

preprint2022arXiv

A Survey on Gradient Inversion: Attacks, Defenses and Future Directions

Recent studies have shown that the training samples can be recovered from gradients, which are called Gradient Inversion (GradInv) attacks. However, there remains a lack of extensive surveys covering recent advances and thorough analysis of this issue. In this paper, we present a comprehensive survey on GradInv, aiming to summarize the cutting-edge research and broaden the horizons for different domains. Firstly, we propose a taxonomy of GradInv attacks by characterizing existing attacks into two paradigms: iteration- and recursion-based attacks. In particular, we dig out some critical ingredients from the iteration-based attacks, including data initialization, model training and gradient matching. Second, we summarize emerging defense strategies against GradInv attacks. We find these approaches focus on three perspectives covering data obscuration, model improvement and gradient protection. Finally, we discuss some promising directions and open problems for further research.

preprint2022arXiv

CE-based white-box adversarial attacks will not work using super-fitting

Deep neural networks are widely used in various fields because of their powerful performance. However, recent studies have shown that deep learning models are vulnerable to adversarial attacks, i.e., adding a slight perturbation to the input will make the model obtain wrong results. This is especially dangerous for some systems with high-security requirements, so this paper proposes a new defense method by using the model super-fitting state to improve the model's adversarial robustness (i.e., the accuracy under adversarial attacks). This paper mathematically proves the effectiveness of super-fitting and enables the model to reach this state quickly by minimizing unrelated category scores (MUCS). Theoretically, super-fitting can resist any existing (even future) CE-based white-box adversarial attacks. In addition, this paper uses a variety of powerful attack algorithms to evaluate the adversarial robustness of super-fitting, and the proposed method is compared with nearly 50 defense models from recent conferences. The experimental results show that the super-fitting method in this paper can make the trained model obtain the highest adversarial robustness.

preprint2022arXiv

Efficient Attribute Unlearning: Towards Selective Removal of Input Attributes from Feature Representations

Recently, the enactment of privacy regulations has promoted the rise of the machine unlearning paradigm. Existing studies of machine unlearning mainly focus on sample-wise unlearning, such that a learnt model will not expose user's privacy at the sample level. Yet we argue that such ability of selective removal should also be presented at the attribute level, especially for the attributes irrelevant to the main task, e.g., whether a person recognized in a face recognition system wears glasses or the age range of that person. Through a comprehensive literature review, it is found that existing studies on attribute-related problems like fairness and de-biasing learning cannot address the above concerns properly. To bridge this gap, we propose a paradigm of selectively removing input attributes from feature representations which we name `attribute unlearning'. In this paradigm, certain attributes will be accurately captured and detached from the learned feature representations at the stage of training, according to their mutual information. The particular attributes will be progressively eliminated along with the training procedure towards convergence, while the rest of attributes related to the main task are preserved for achieving competitive model performance. Considering the computational complexity during the training process, we not only give a theoretically approximate training method, but also propose an acceleration scheme to speed up the training process. We validate our method by spanning several datasets and models and demonstrate that our design can preserve model fidelity and reach prevailing unlearning efficacy with high efficiency. The proposed unlearning paradigm builds a foundation for future machine unlearning system and will become an essential component of the latest privacy-related legislation.

preprint2022arXiv

Federated Unlearning via Class-Discriminative Pruning

We explore the problem of selectively forgetting categories from trained CNN classification models in the federated learning (FL). Given that the data used for training cannot be accessed globally in FL, our insights probe deep into the internal influence of each channel. Through the visualization of feature maps activated by different channels, we observe that different channels have a varying contribution to different categories in image classification. Inspired by this, we propose a method for scrubbing the model clean of information about particular categories. The method does not require retraining from scratch, nor global access to the data used for training. Instead, we introduce the concept of Term Frequency Inverse Document Frequency (TF-IDF) to quantize the class discrimination of channels. Channels with high TF-IDF scores have more discrimination on the target categories and thus need to be pruned to unlearn. The channel pruning is followed by a fine-tuning process to recover the performance of the pruned model. Evaluated on CIFAR10 dataset, our method accelerates the speed of unlearning by 8.9x for the ResNet model, and 7.9x for the VGG model under no degradation in accuracy, compared to retraining from scratch. For CIFAR100 dataset, the speedups are 9.9x and 8.4x, respectively. We envision this work as a complementary block for FL towards compliance with legal and ethical criteria.

preprint2022arXiv

Layer-wised Model Aggregation for Personalized Federated Learning

Personalized Federated Learning (pFL) not only can capture the common priors from broad range of distributed data, but also support customized models for heterogeneous clients. Researches over the past few years have applied the weighted aggregation manner to produce personalized models, where the weights are determined by calibrating the distance of the entire model parameters or loss values, and have yet to consider the layer-level impacts to the aggregation process, leading to lagged model convergence and inadequate personalization over non-IID datasets. In this paper, we propose a novel pFL training framework dubbed Layer-wised Personalized Federated learning (pFedLA) that can discern the importance of each layer from different clients, and thus is able to optimize the personalized model aggregation for clients with heterogeneous data. Specifically, we employ a dedicated hypernetwork per client on the server side, which is trained to identify the mutual contribution factors at layer granularity. Meanwhile, a parameterized mechanism is introduced to update the layer-wised aggregation weights to progressively exploit the inter-user similarity and realize accurate model personalization. Extensive experiments are conducted over different models and learning tasks, and we show that the proposed methods achieve significantly higher performance than state-of-the-art pFL methods.

preprint2022arXiv

Personalized Federated Learning with Contextualized Generalization

The prevalent personalized federated learning (PFL) usually pursues a trade-off between personalization and generalization by maintaining a shared global model to guide the training process of local models. However, the sole global model may easily transfer deviated context knowledge to some local models when multiple latent contexts exist across the local datasets. In this paper, we propose a novel concept called contextualized generalization (CG) to provide each client with fine-grained context knowledge that can better fit the local data distributions and facilitate faster model convergence, based on which we properly design a framework of PFL, dubbed CGPFL. We conduct detailed theoretical analysis, in which the convergence guarantee is presented and $\mathcal{O}(\sqrt{K})$ speedup over most existing methods is granted. To quantitatively study the generalization-personalization trade-off, we introduce the 'generalization error' measure and prove that the proposed CGPFL can achieve a better trade-off than existing solutions. Moreover, our theoretical analysis further inspires a heuristic algorithm to find a near-optimal trade-off in CGPFL. Experimental results on multiple real-world datasets show that our approach surpasses the state-of-the-art methods on test accuracy by a significant margin.

preprint2022arXiv

PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead of Models -- Federated Learning in Age of Foundation Model

Quick global aggregation of effective distributed parameters is crucial to federated learning (FL), which requires adequate bandwidth for parameters communication and sufficient user data for local training. Otherwise, FL may cost excessive training time for convergence and produce inaccurate models. In this paper, we propose a brand-new FL framework, PromptFL, that replaces the federated model training with the federated prompt training, i.e., let federated participants train prompts instead of a shared model, to simultaneously achieve the efficient global aggregation and local training on insufficient data by exploiting the power of foundation models (FM) in a distributed way. PromptFL ships an off-the-shelf FM, i.e., CLIP, to distributed clients who would cooperatively train shared soft prompts based on very few local data. Since PromptFL only needs to update the prompts instead of the whole model, both the local training and the global aggregation can be significantly accelerated. And FM trained over large scale data can provide strong adaptation capability to distributed users tasks with the trained soft prompts. We empirically analyze the PromptFL via extensive experiments, and show its superiority in terms of system feasibility, user privacy, and performance.

preprint2022arXiv

Rethinking Classifier and Adversarial Attack

Various defense models have been proposed to resist adversarial attack algorithms, but existing adversarial robustness evaluation methods always overestimate the adversarial robustness of these models (i.e., not approaching the lower bound of robustness). To solve this problem, this paper uses the proposed decouple space method to divide the classifier into two parts: non-linear and linear. Then, this paper defines the representation vector of the original example (and its space, i.e., the representation space) and uses the iterative optimization of Absolute Classification Boundaries Initialization (ACBI) to obtain a better attack starting point. Particularly, this paper applies ACBI to nearly 50 widely-used defense models (including 8 architectures). Experimental results show that ACBI achieves lower robust accuracy in all cases.

preprint2022arXiv

Towards Unbiased Multi-label Zero-Shot Learning with Pyramid and Semantic Attention

Multi-label zero-shot learning extends conventional single-label zero-shot learning to a more realistic scenario that aims at recognizing multiple unseen labels of classes for each input sample. Existing works usually exploit attention mechanism to generate the correlation among different labels. However, most of them are usually biased on several major classes while neglect most of the minor classes with the same importance in input samples, and may thus result in overly diffused attention maps that cannot sufficiently cover minor classes. We argue that disregarding the connection between major and minor classes, i.e., correspond to the global and local information, respectively, is the cause of the problem. In this paper, we propose a novel framework of unbiased multi-label zero-shot learning, by considering various class-specific regions to calibrate the training process of the classifier. Specifically, Pyramid Feature Attention (PFA) is proposed to build the correlation between global and local information of samples to balance the presence of each class. Meanwhile, for the generated semantic representations of input samples, we propose Semantic Attention (SA) to strengthen the element-wise correlation among these vectors, which can encourage the coordinated representation of them. Extensive experiments on the large-scale multi-label zero-shot benchmarks NUS-WIDE and Open-Image demonstrate that the proposed method surpasses other representative methods by significant margins.

preprint2021arXiv

Gain without Pain: Offsetting DP-injected Nosies Stealthily in Cross-device Federated Learning

Federated Learning (FL) is an emerging paradigm through which decentralized devices can collaboratively train a common model. However, a serious concern is the leakage of privacy from exchanged gradient information between clients and the parameter server (PS) in FL. To protect gradient information, clients can adopt differential privacy (DP) to add additional noises and distort original gradients before they are uploaded to the PS. Nevertheless, the model accuracy will be significantly impaired by DP noises, making DP impracticable in real systems. In this work, we propose a novel Noise Information Secretly Sharing (NISS) algorithm to alleviate the disturbance of DP noises by sharing negated noises among clients. We theoretically prove that: 1) If clients are trustworthy, DP noises can be perfectly offset on the PS; 2) Clients can easily distort negated DP noises to protect themselves in case that other clients are not totally trustworthy, though the cost lowers model accuracy. NISS is particularly applicable for FL across multiple IoT (Internet of Things) systems, in which all IoT devices need to collaboratively train a model. To verify the effectiveness and the superiority of the NISS algorithm, we conduct experiments with the MNIST and CIFAR-10 datasets. The experiment results verify our analysis and demonstrate that NISS can improve model accuracy by 21% on average and obtain better privacy protection if clients are trustworthy.

preprint2021arXiv

Optimizing Video Caching at the Edge: A Hybrid Multi-Point Process Approach

It is always a challenging problem to deliver a huge volume of videos over the Internet. To meet the high bandwidth and stringent playback demand, one feasible solution is to cache video contents on edge servers based on predicted video popularity. Traditional caching algorithms (e.g., LRU, LFU) are too simple to capture the dynamics of video popularity, especially long-tailed videos. Recent learning-driven caching algorithms (e.g., DeepCache) show promising performance, however, such black-box approaches are lack of explainability and interpretability. Moreover, the parameter tuning requires a large number of historical records, which are difficult to obtain for videos with low popularity. In this paper, we optimize video caching at the edge using a white-box approach, which is highly efficient and also completely explainable. To accurately capture the evolution of video popularity, we develop a mathematical model called \emph{HRS} model, which is the combination of multiple point processes, including Hawkes' self-exciting, reactive and self-correcting processes. The key advantage of the HRS model is its explainability, and much less number of model parameters. In addition, all its model parameters can be learned automatically through maximizing the Log-likelihood function constructed by past video request events. Next, we further design an online HRS-based video caching algorithm. To verify its effectiveness, we conduct a series of experiments using real video traces collected from Tencent Video, one of the largest online video providers in China. Experiment results demonstrate that our proposed algorithm outperforms the state-of-the-art algorithms, with 12.3\% improvement on average in terms of cache hit rate under realistic settings.

preprint2020arXiv

A Novel Perspective to Zero-shot Learning: Towards an Alignment of Manifold Structures via Semantic Feature Expansion

Zero-shot learning aims at recognizing unseen classes (no training example) with knowledge transferred from seen classes. This is typically achieved by exploiting a semantic feature space shared by both seen and unseen classes, i.e., attribute or word vector, as the bridge. One common practice in zero-shot learning is to train a projection between the visual and semantic feature spaces with labeled seen classes examples. When inferring, this learned projection is applied to unseen classes and recognizes the class labels by some metrics. However, the visual and semantic feature spaces are mutually independent and have quite different manifold structures. Under such a paradigm, most existing methods easily suffer from the domain shift problem and weaken the performance of zero-shot recognition. To address this issue, we propose a novel model called AMS-SFE. It considers the alignment of manifold structures by semantic feature expansion. Specifically, we build upon an autoencoder-based model to expand the semantic features from the visual inputs. Additionally, the expansion is jointly guided by an embedded manifold extracted from the visual feature space of the data. Our model is the first attempt to align both feature spaces by expanding semantic features and derives two benefits: first, we expand some auxiliary features that enhance the semantic feature space; second and more importantly, we implicitly align the manifold structures between the visual and semantic feature spaces; thus, the projection can be better trained and mitigate the domain shift problem. Extensive experiments show significant performance improvement, which verifies the effectiveness of our model.

preprint2020arXiv

Intermittent Pulling with Local Compensation for Communication-Efficient Federated Learning

Federated Learning is a powerful machine learning paradigm to cooperatively train a global model with highly distributed data. A major bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) algorithm for large-scale Federated Learning is the communication overhead on pushing local gradients and pulling global model. In this paper, to reduce the communication complexity of Federated Learning, a novel approach named Pulling Reduction with Local Compensation (PRLC) is proposed. Specifically, each training node intermittently pulls the global model from the server in SGD iterations, resulting in that it is sometimes unsynchronized with the server. In such a case, it will use its local update to compensate the gap between the local model and the global model. Our rigorous theoretical analysis of PRLC achieves two important findings. First, we prove that the convergence rate of PRLC preserves the same order as the classical synchronous SGD for both strongly-convex and non-convex cases with good scalability due to the linear speedup with respect to the number of training nodes. Second, we show that PRLC admits lower pulling frequency than the existing pulling reduction method without local compensation. We also conduct extensive experiments on various machine learning models to validate our theoretical results. Experimental results show that our approach achieves a significant pulling reduction over the state-of-the-art methods, e.g., PRLC requiring only half of the pulling operations of LAG.

preprint2020arXiv

Prophet: Proactive Candidate-Selection for Federated Learning by Predicting the Qualities of Training and Reporting Phases

Although the challenge of the device connection is much relieved in 5G networks, the training latency is still an obstacle preventing Federated Learning (FL) from being largely adopted. One of the most fundamental problems that lead to large latency is the bad candidate-selection for FL. In the dynamic environment, the mobile devices selected by the existing reactive candidate-selection algorithms very possibly fail to complete the training and reporting phases of FL, because the FL parameter server only knows the currently-observed resources of all candidates. To this end, we study the proactive candidate-selection for FL in this paper. We first let each candidate device predict the qualities of both its training and reporting phases locally using LSTM. Then, the proposed candidateselection algorithm is implemented by the Deep Reinforcement Learning (DRL) framework. Finally, the real-world trace-driven experiments prove that the proposed approach outperforms the existing reactive algorithms

preprint2020arXiv

Scalable and Communication-efficient Decentralized Federated Edge Learning with Multi-blockchain Framework

The emerging Federated Edge Learning (FEL) technique has drawn considerable attention, which not only ensures good machine learning performance but also solves "data island" problems caused by data privacy concerns. However, large-scale FEL still faces following crucial challenges: (i) there lacks a secure and communication-efficient model training scheme for FEL; (2) there is no scalable and flexible FEL framework for updating local models and global model sharing (trading) management. To bridge the gaps, we first propose a blockchain-empowered secure FEL system with a hierarchical blockchain framework consisting of a main chain and subchains. This framework can achieve scalable and flexible decentralized FEL by individually manage local model updates or model sharing records for performance isolation. A Proof-of-Verifying consensus scheme is then designed to remove low-quality model updates and manage qualified model updates in a decentralized and secure manner, thereby achieving secure FEL. To improve communication efficiency of the blockchain-empowered FEL, a gradient compression scheme is designed to generate sparse but important gradients to reduce communication overhead without compromising accuracy, and also further strengthen privacy preservation of training data. The security analysis and numerical results indicate that the proposed schemes can achieve secure, scalable, and communication-efficient decentralized FEL.