Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
50works
0followers
19topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

50 published item(s)

preprint2026arXiv

A Survey on 3D Skeleton Based Person Re-Identification: Taxonomy, Advances, Challenges, and Interdisciplinary Prospects

Person re-identification via 3D skeletons is an important emerging research area that attracts increasing attention within the pattern recognition community. With distinctive advantages across various application scenarios, numerous 3D skeleton based person re-identification (SRID) methods with diverse skeleton modeling and learning paradigms have been proposed in recent years. In this paper, we provide a comprehensive review and analysis of recent SRID advances. First of all, we define the SRID task and provide an overview of its origin and major advancements. Secondly, we formulate a systematic taxonomy that organizes existing methods into three categories centered on hand-crafted, sequence-based, and graph-based modeling. Then, we elaborate on the representative models along these three types with an illustration of foundational mechanisms. Meanwhile, we provide an overview of mainstream supervised, self-supervised, and unsupervised SRID learning paradigms and corresponding common methods. A thorough evaluation of state-of-the-art SRID methods is further conducted over various types of benchmarks and protocols to compare their effectiveness, efficiency, and key properties. Finally, we present the key challenges and prospects to advance future research, and highlight interdisciplinary applications of SRID with a case study.

preprint2026arXiv

Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.

preprint2026arXiv

EVA-0: Test-Time Model Evolution with Only Two Forward Passes per Sample

Test-time model evolution offers a promising way for deployed models to improve from unlabeled test-time experience, yet most existing methods depend on backpropagation (BP), which incurs substantial memory overhead and makes them difficult to deploy on edge devices, quantized models, specialized accelerators, or black-box models. In this work, we study test-time model evolution under a strict two-forward budget, a setting that pushes adaptation toward highly efficient real-world deployment. We reveal three key obstacles in zeroth-order test-time optimization: susceptibility to shortcut solutions, uncontrolled weight drift, and ineffective update direction estimation. To overcome them, we propose EVA-0, a minimal zeroth-order adaptation framework that: 1) keeps the loss scale-invariant to prevent shortcut solutions; 2) devises an anchor-guided optimization strategy to alleviate weight drift; 3) uses sample-wise symmetric two-sided perturbation for update direction estimation and inference. EVA-0 requires no BP and performs both inference and adaptation within only two forward passes per sample. Results on ImageNet-C & ViT-Base show that EVA-0 outperforms both BP-based DeYO and BP-free FOA, while achieving a 14x speed-up over FOA. Code will be released.

preprint2026arXiv

Low-Dimensional Federated Knowledge Graph Embedding via Knowledge Distillation

Federated Knowledge Graph Embedding (FKGE) aims to facilitate collaborative learning of entity and relation embeddings from distributed Knowledge Graphs (KGs) across multiple clients, while preserving data privacy. Training FKGE models with higher dimensions is typically favored due to their potential for achieving superior performance. However, high-dimensional embeddings present significant challenges in terms of storage resource and inference speed. Unlike traditional KG embedding methods, FKGE involves multiple client-server communication rounds, where communication efficiency is critical. Existing embedding compression methods for traditional KGs may not be directly applicable to FKGE as they often require multiple model trainings which potentially incur substantial communication costs. In this paper, we propose a light-weight component based on Knowledge Distillation (KD) which is titled FedKD and tailored specifically for FKGE methods. During client-side local training, FedKD facilitates the low-dimensional student model to mimic the score distribution of triples from the high-dimensional teacher model using KL divergence loss. Unlike traditional KD way, FedKD adaptively learns a temperature to scale the score of positive triples and separately adjusts the scores of corresponding negative triples using a predefined temperature, thereby mitigating teacher over-confidence issue. Furthermore, we dynamically adjust the weight of KD loss to optimize the training process. Extensive experiments on three datasets support the effectiveness of FedKD.

preprint2026arXiv

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

preprint2026arXiv

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

preprint2023arXiv

Converse Attention Knowledge Transfer for Low-Resource Named Entity Recognition

In recent years, great success has been achieved in many tasks of natural language processing (NLP), e.g., named entity recognition (NER), especially in the high-resource language, i.e., English, thanks in part to the considerable amount of labeled resources. However, most low-resource languages do not have such an abundance of labeled data as high-resource English, leading to poor performance of NER in these low-resource languages. Inspired by knowledge transfer, we propose Converse Attention Network, or CAN in short, to improve the performance of NER in low-resource languages by leveraging the knowledge learned in pretrained high-resource English models. CAN first translates low-resource languages into high-resource English using an attention based translation module. In the process of translation, CAN obtain the attention matrices that align the two languages. Furthermore, CAN use the attention matrices to align the high-resource semantic features from a pretrained high-resource English model with the low-resource semantic features. As a result, CAN obtains aligned high-resource semantic features to enrich the representations of low-resource languages. Experiments on four low-resource NER datasets show that CAN achieves consistent and significant performance improvements, which indicates the effectiveness of CAN.

preprint2022arXiv

A Dynamic Hierarchical Framework for IoT-assisted Metaverse Synchronization

Metaverse has recently attracted much attention from both academia and industry. Virtual services, ranging from virtual driver training to online route optimization for smart goods delivery, are emerging in the Metaverse. To make the human experience of virtual life more real, digital twins (DTs), namely digital replicas of physical objects, are key enablers. However, DT status may not always accurately reflect that of its real-world twin because the latter may be subject to changes with time. As such, it is necessary to synchronize a DT with its physical counterpart to ensure that its status is accurate for virtual businesses in the Metaverse. In this paper, we propose a dynamic hierarchical framework in which a group of IoT devices is incentivized to sense and collect physical objects' status information collectively so as to assists virtual service providers (VSPs) in synchronizing DTs. Based on the collected sensing data and the value decay rate of the DTs, the VSPs can determine synchronization intensities to maximize their payoffs. In our proposed dynamic hierarchical framework, the lower-level evolutionary game captures the VSPs selection by the IoT device population, and the upper-level differential game captures the VSPs payoffs, which are affected by the synchronization strategy, IoT devices selections, and the DTs value status, given VSPs are simultaneous decision makers. We further consider the case in which some VSPs are first movers and extend it as a Stackelberg differential game. We theoretically and experimentally show that the equilibrium to the lower-level game exists and is evolutionarily robust, and provide a sensitivity analysis with respect to various system parameters. Experiments show that the proposed dynamic hierarchical game outperform the baseline.

preprint2022arXiv

A Full Dive into Realizing the Edge-enabled Metaverse: Visions, Enabling Technologies,and Challenges

Dubbed "the successor to the mobile Internet", the concept of the Metaverse has grown in popularity. While there exist lite versions of the Metaverse today, they are still far from realizing the full vision of an immersive, embodied, and interoperable Metaverse. Without addressing the issues of implementation from the communication and networking, as well as computation perspectives, the Metaverse is difficult to succeed the Internet, especially in terms of its accessibility to billions of users today. In this survey, we focus on the edge-enabled Metaverse to realize its ultimate vision. We first provide readers with a succinct tutorial of the Metaverse, an introduction to the architecture, as well as current developments. To enable ubiquitous, seamless, and embodied access to the Metaverse, we discuss the communication and networking challenges and survey cutting-edge solutions and concepts that leverage next-generation communication systems for users to immerse as and interact with embodied avatars in the Metaverse. Moreover, given the high computation costs required, e.g., to render 3D virtual worlds and run data-hungry artificial intelligence-driven avatars, we discuss the computation challenges and cloud-edge-end computation framework-driven solutions to realize the Metaverse on resource-constrained edge devices. Next, we explore how blockchain technologies can aid in the interoperable development of the Metaverse, not just in terms of empowering the economic circulation of virtual user-generated content but also to manage physical edge resources in a decentralized, transparent, and immutable manner. Finally, we discuss the future research directions towards realizing the true vision of the edge-enabled Metaverse.

preprint2022arXiv

Bi-level Feature Alignment for Versatile Image Translation and Manipulation

Generative adversarial networks (GANs) have achieved great success in image translation and manipulation. However, high-fidelity image generation with faithful style control remains a grand challenge in computer vision. This paper presents a versatile image translation and manipulation framework that achieves accurate semantic and style guidance in image generation by explicitly building a correspondence. To handle the quadratic complexity incurred by building the dense correspondences, we introduce a bi-level feature alignment strategy that adopts a top-$k$ operation to rank block-wise features followed by dense attention between block features which reduces memory cost substantially. As the top-$k$ operation involves index swapping which precludes the gradient propagation, we approximate the non-differentiable top-$k$ operation with a regularized earth mover's problem so that its gradient can be effectively back-propagated. In addition, we design a novel semantic position encoding mechanism that builds up coordinate for each individual semantic region to preserve texture structures while building correspondences. Further, we design a novel confidence feature injection module which mitigates mismatch problem by fusing features adaptively according to the reliability of built correspondences. Extensive experiments show that our method achieves superior performance qualitatively and quantitatively as compared with the state-of-the-art.

preprint2022arXiv

Cross-Modal Graph with Meta Concepts for Video Captioning

Video captioning targets interpreting the complex visual contents as text descriptions, which requires the model to fully understand video scenes including objects and their interactions. Prevailing methods adopt off-the-shelf object detection networks to give object proposals and use the attention mechanism to model the relations between objects. They often miss some undefined semantic concepts of the pretrained model and fail to identify exact predicate relationships between objects. In this paper, we investigate an open research task of generating text descriptions for the given videos, and propose Cross-Modal Graph (CMG) with meta concepts for video captioning. Specifically, to cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions, where the associated visual regions and textual words are named cross-modal meta concepts. We further build meta concept graphs dynamically with the learned cross-modal meta concepts. We also construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures. We validate the efficacy of our proposed techniques with extensive experiments and achieve state-of-the-art results on two public datasets.

preprint2022arXiv

Decomposing Generation Networks with Structure Prediction for Recipe Generation

Recipe generation from food images and ingredients is a challenging task, which requires the interpretation of the information from another modality. Different from the image captioning task, where the captions usually have one sentence, cooking instructions contain multiple sentences and have obvious structures. To help the model capture the recipe structure and avoid missing some cooking details, we propose a novel framework: Decomposing Generation Networks (DGN) with structure prediction, to get more structured and complete recipe generation outputs. Specifically, we split each cooking instruction into several phases, and assign different sub-generators to each phase. Our approach includes two novel ideas: (i) learning the recipe structures with the global structure prediction component and (ii) producing recipe phases in the sub-generator output component based on the predicted structure. Extensive experiments on the challenging large-scale Recipe1M dataset validate the effectiveness of our proposed model, which improves the performance over the state-of-the-art results.

preprint2022arXiv

DualCF: Efficient Model Extraction Attack from Counterfactual Explanations

Cloud service providers have launched Machine-Learning-as-a-Service (MLaaS) platforms to allow users to access large-scale cloudbased models via APIs. In addition to prediction outputs, these APIs can also provide other information in a more human-understandable way, such as counterfactual explanations (CF). However, such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks which aim to steal the internal functionality of models in the cloud. Due to the black-box nature of cloud models, however, a vast number of queries are inevitably required by existing attack strategies before the substitute model achieves high fidelity. In this paper, we propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model. This is motivated by our observation that current querying strategies suffer from decision boundary shift issue induced by taking far-distant queries and close-to-boundary CFs into substitute model training. We then propose DualCF strategy to circumvent the above issues, which is achieved by taking not only CF but also counterfactual explanation of CF (CCF) as pairs of training samples for the substitute model. Extensive and comprehensive experimental evaluations are conducted on both synthetic and real-world datasets. The experimental results favorably illustrate that DualCF can produce a high-fidelity model with fewer queries efficiently and effectively.

preprint2022arXiv

Economics of Semantic Communication System: An Auction Approach

Semantic communication technologies enable wireless edge devices to communicate effectively by transmitting semantic meaning of data. Edge components, such as vehicles in next-generation intelligent transport systems, use well-trained semantic models to encode and decode semantic information extracted from raw and sensor data. However, the limitation in computing resources makes it difficult to support the training process of accurate semantic models on edge devices. As such, edge devices can buy the pretrained semantic models from semantic model providers, which is called "semantic model trading". Upon collecting semantic information with the semantic models, the edge devices can then sell the extracted semantic information, e.g., information about urban road conditions or traffic signs, to the interested buyers for profit, which is called "semantic information trading". To facilitate both types of the trades, effective incentive mechanisms should be designed. Thus, in this paper, we propose a hierarchical trading system to support both semantic model trading and semantic information trading jointly. The proposed incentive mechanism helps to maximize the revenue of semantic model providers in the semantic model trading, and effectively incentivizes model providers to participate in the development of semantic communication systems. For semantic information trading, our designed auction approach can support the trading between multiple semantic information sellers and buyers, while ensuring individual rationality, incentive compatibility, and budget balance, and moreover, allowing them achieve higher utilities than the baseline method.

preprint2022arXiv

Enhancing Sequential Recommendation with Graph Contrastive Learning

The sequential recommendation systems capture users' dynamic behavior patterns to predict their next interaction behaviors. Most existing sequential recommendation methods only exploit the local context information of an individual interaction sequence and learn model parameters solely based on the item prediction loss. Thus, they usually fail to learn appropriate sequence representations. This paper proposes a novel recommendation framework, namely Graph Contrastive Learning for Sequential Recommendation (GCL4SR). Specifically, GCL4SR employs a Weighted Item Transition Graph (WITG), built based on interaction sequences of all users, to provide global context information for each interaction and weaken the noise information in the sequence data. Moreover, GCL4SR uses subgraphs of WITG to augment the representation of each interaction sequence. Two auxiliary learning objectives have also been proposed to maximize the consistency between augmented representations induced by the same interaction sequence on WITG, and minimize the difference between the representations augmented by the global context on WITG and the local representation of the original sequence. Extensive experiments on real-world datasets demonstrate that GCL4SR consistently outperforms state-of-the-art sequential recommendation methods.

preprint2022arXiv

Federated Learning for Personalized Humor Recognition

Computational understanding of humor is an important topic under creative language understanding and modeling. It can play a key role in complex human-AI interactions. The challenge here is that human perception of humorous content is highly subjective. The same joke may receive different funniness ratings from different readers. This makes it highly challenging for humor recognition models to achieve personalization in practical scenarios. Existing approaches are generally designed based on the assumption that users have a consensus on whether a given text is humorous or not. Thus, they cannot handle diverse humor preferences well. In this paper, we propose the FedHumor approach for the recognition of humorous content in a personalized manner through Federated Learning (FL). Extending a pre-trained language model, FedHumor guides the fine-tuning process by considering diverse distributions of humor preferences from individuals. It incorporates a diversity adaptation strategy into the FL paradigm to train a personalized humor recognition model. To the best of our knowledge, FedHumor is the first text-based personalized humor recognition model through federated learning. Extensive experiments demonstrate the advantage of FedHumor in recognizing humorous texts compared to nine state-of-the-art humor recognition approaches with superior capability for handling the diversity in humor labels produced by users with diverse preferences.

preprint2022arXiv

From Psychological Curiosity to Artificial Curiosity: Curiosity-Driven Learning in Artificial Intelligence Tasks

Psychological curiosity plays a significant role in human intelligence to enhance learning through exploration and information acquisition. In the Artificial Intelligence (AI) community, artificial curiosity provides a natural intrinsic motivation for efficient learning as inspired by human cognitive development; meanwhile, it can bridge the existing gap between AI research and practical application scenarios, such as overfitting, poor generalization, limited training samples, high computational cost, etc. As a result, curiosity-driven learning (CDL) has become increasingly popular, where agents are self-motivated to learn novel knowledge. In this paper, we first present a comprehensive review on the psychological study of curiosity and summarize a unified framework for quantifying curiosity as well as its arousal mechanism. Based on the psychological principle, we further survey the literature of existing CDL methods in the fields of Reinforcement Learning, Recommendation, and Classification, where both advantages and disadvantages as well as future work are discussed. As a result, this work provides fruitful insights for future CDL research and yield possible directions for further improvement.

preprint2022arXiv

GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems

Much recent progress in task-oriented dialogue (ToD) systems has been driven by available annotation data across multiple domains for training. Over the last few years, there has been a move towards data curation for multilingual ToD systems that are applicable to serve people speaking different languages. However, existing multilingual ToD datasets either have a limited coverage of languages due to the high cost of data curation, or ignore the fact that dialogue entities barely exist in countries speaking these languages. To tackle these limitations, we introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset globalized from an English ToD dataset for three unexplored use cases. Our method is based on translating dialogue templates and filling them with local entities in the target-language countries. We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.

preprint2022arXiv

Learning Structural Representations for Recipe Generation and Food Retrieval

Food is significant to human daily life. In this paper, we are interested in learning structural representations for lengthy recipes, that can benefit the recipe generation and food cross-modal retrieval tasks. Different from the common vision-language data, here the food images contain mixed ingredients and target recipes are lengthy paragraphs, where we do not have annotations on structure information. To address the above limitations, we propose a novel method to unsupervisedly learn the sentence-level tree structures for the cooking recipes. Our approach brings together several novel ideas in a systematic framework: (1) exploiting an unsupervised learning approach to obtain the sentence-level tree structure labels before training; (2) generating trees of target recipes from images with the supervision of tree structure labels learned from (1); and (3) integrating the learned tree structures into the recipe generation and food cross-modal retrieval procedure. Our proposed model can produce good-quality sentence-level tree structures and coherent recipes. We achieve the state-of-the-art recipe generation and food cross-modal retrieval performance on the benchmark Recipe1M dataset.

preprint2022arXiv

MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER

Data augmentation is an effective solution to data scarcity in low-resource scenarios. However, when applied to token-level tasks such as NER, data augmentation methods often suffer from token-label misalignment, which leads to unsatsifactory performance. In this work, we propose Masked Entity Language Modeling (MELM) as a novel data augmentation framework for low-resource NER. To alleviate the token-label misalignment issue, we explicitly inject NER labels into sentence context, and thus the fine-tuned MELM is able to predict masked entity tokens by explicitly conditioning on their labels. Thereby, MELM generates high-quality augmented data with novel entities, which provides rich entity regularity knowledge and boosts NER performance. When training data from multiple languages are available, we also integrate MELM with code-mixing for further improvement. We demonstrate the effectiveness of MELM on monolingual, cross-lingual and multilingual NER across various low-resource levels. Experimental results show that our MELM presents substantial improvement over the baseline methods.

preprint2022arXiv

Minimalist and High-performance Conversational Recommendation with Uncertainty Estimation for User Preference

Conversational recommendation system (CRS) is emerging as a user-friendly way to capture users' dynamic preferences over candidate items and attributes. Multi-shot CRS is designed to make recommendations multiple times until the user either accepts the recommendation or leaves at the end of their patience. Existing works are trained with reinforcement learning (RL), which may suffer from unstable learning and prohibitively high demands for computing. In this work, we propose a simple and efficient CRS, MInimalist Non-reinforced Interactive COnversational Recommender Network (MINICORN). MINICORN models the epistemic uncertainty of the estimated user preference and queries the user for the attribute with the highest uncertainty. The system employs a simple network architecture and makes the query-vs-recommendation decision using a single rule. Somewhat surprisingly, this minimalist approach outperforms state-of-the-art RL methods on three real-world datasets by large margins. We hope that MINICORN will serve as a valuable baseline for future research.

preprint2022arXiv

On Non-Random Missing Labels in Semi-Supervised Learning

Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solutions that overlook the role of "class" in causing the non-randomness, e.g., users are more likely to label popular classes, we explicitly incorporate "class" into SSL. Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data. 2) To encourage rare class training, whose model is low-recall but high-precision that discards too many pseudo-labeled data, we propose Class-Aware Imputation (CAI) that dynamically decreases (or increases) the pseudo-label assignment threshold for rare (or frequent) classes. 3) Overall, we integrate CAP and CAI into a Class-Aware Doubly Robust (CADR) estimator for training an unbiased SSL model. Under various MNAR settings and ablations, our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods. Please check our code at: https://github.com/JoyHuYY1412/CADR-FixMatch.

preprint2022arXiv

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

This paper investigates an open research problem of generating text-image pairs to improve the training of fine-grained image-to-text cross-modal retrieval task, and proposes a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model. Specifically, we first train a StyleGAN2 model on the given dataset. We then project the real images back to the latent space of StyleGAN2 to obtain the latent codes. To make the generated images manipulatable, we further introduce a latent space alignment module to learn the alignment between StyleGAN2 latent codes and the corresponding textual caption features. When we do online paired data augmentation, we first generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module to output the latent codes, which are finally fed to StyleGAN2 to generate the augmented images. We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets, in which the promising experimental results demonstrate the augmented text-image pair data can be trained together with the original data to boost the image-to-text cross-modal retrieval performance.

preprint2022arXiv

Quantum-Secured Space-Air-Ground Integrated Networks: Concept, Framework, and Case Study

In the upcoming 6G era, existing terrestrial networks have evolved toward space-air-ground integrated networks (SAGIN), providing ultra-high data rates, seamless network coverage, and ubiquitous intelligence for communications of applications and services. However, conventional communications in SAGIN still face data confidentiality issues. Fortunately, the concept of Quantum Key Distribution (QKD) over SAGIN is able to provide information-theoretic security for secure communications in SAGIN with quantum cryptography. Therefore, in this paper, we propose the quantum-secured SAGIN which is feasible to achieve proven secure communications using quantum mechanics to protect data channels between space, air, and ground nodes. Moreover, we propose a universal QKD service provisioning framework to minimize the cost of QKD services under the uncertainty and dynamics of communications in quantum-secured SAGIN. In this framework, fiber-based QKD services are deployed in passive optical networks with the advantages of low loss and high stability. Moreover, the widely covered and flexible satellite- and UAV-based QKD services are provisioned as a supplement during the real-time data transmission phase. Finally, to examine the effectiveness of the proposed concept and framework, a case study of quantum-secured SAGIN in the Metaverse is conducted where uncertain and dynamic factors of the secure communications in Metaverse applications are effectively resolved in the proposed framework.

preprint2022arXiv

Realizing the Metaverse with Edge Intelligence: A Match Made in Heaven

Dubbed "the successor to the mobile Internet", the concept of the Metaverse has recently exploded in popularity. While there exists lite versions of the Metaverse today, we are still far from realizing the vision of a seamless, shardless, and interoperable Metaverse given the stringent sensing, communication, and computation requirements. Moreover, the birth of the Metaverse comes amid growing privacy concerns among users. In this article, we begin by providing a preliminary definition of the Metaverse. We discuss the architecture of the Metaverse and mainly focus on motivating the convergence of edge intelligence and the infrastructure layer of the Metaverse. We present major edge-based technological developments and their integration to support the Metaverse engine. Then, we present our research attempts through a case study of virtual city development in the Metaverse. Finally, we discuss the open research issues.

preprint2022arXiv

Reliable Distributed Computing for Metaverse: A Hierarchical Game-Theoretic Approach

The metaverse is regarded as a new wave of technological transformation that provides a virtual space for people to interact through digital avatars. To achieve immersive user experiences in the metaverse, real-time rendering is the key technology. However, computing-intensive tasks of real-time rendering from metaverse service providers cannot be processed efficiently on a single resource-limited mobile device. Alternatively, such mobile devices can offload the metaverse rendering tasks to other mobile devices by adopting the collaborative computing paradigm based on Coded Distributed Computing (CDC). Therefore, this paper introduces a hierarchical game-theoretic CDC framework for the metaverse services, especially for the vehicular metaverse. In the framework, idle resources from vehicles, acting as CDC workers, are aggregated to handle intensive computation tasks in the vehicular metaverse. Specifically, in the upper layer, a miner coalition formation game is formulated based on a reputation metric to select reliable workers. To guarantee the reliable management of reputation values, the reputation values calculated based on the subjective logical model are maintained in a blockchain database. In the lower layer, a Stackelberg game-based incentive mechanism is considered to attract reliable workers selected in the upper layer to participate in rendering tasks. The simulation results illustrate that the proposed framework is resistant to malicious workers. Compared with the best-effort worker selection scheme, the proposed scheme can improve the utility of metaverse service provider and the average profit of CDC workers.

preprint2022arXiv

Semantic Information Market For The Metaverse: An Auction Based Approach

In this paper, we address the networking and communications problems of creating a digital copy in the Metaverse digital twin. Specifically, a virtual service provider (VSP) which is responsible for creating and rendering the Metaverse, is required to use the data collected by IoT devices to create the virtual copy of the physical world. However, due to the huge volume of the collected data by IoT devices (e.g., images and videos) and the limited bandwidth, the VSP might become unable to retrieve all the required data from the physical world. Furthermore, the Metaverse needs fast replication (e.g., rendering) of the digital copy adding more restrictions on the data transmission delay. To solve the aforementioned challenges, we propose to equip the IoT devices with semantic information extraction algorithms to minimize the size of the transmitted data over the wireless channels. Since many IoT devices will be interested to sell their semantic information to the VSP, we propose a truthful reverse auction mechanism that helps the VSP select only IoT devices that can improve the quality of its virtual copy of objects through the semantic information. We conduct extensive simulations on a dataset that contains synchronized camera and radar images, and show that our novel design enables a fast replication of the digital copy with high accuracy.

preprint2022arXiv

SimMC: Simple Masked Contrastive Learning of Skeleton Representations for Unsupervised Person Re-Identification

Recent advances in skeleton-based person re-identification (re-ID) obtain impressive performance via either hand-crafted skeleton descriptors or skeleton representation learning with deep learning paradigms. However, they typically require skeletal pre-modeling and label information for training, which leads to limited applicability of these methods. In this paper, we focus on unsupervised skeleton-based person re-ID, and present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID. Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme to cluster the most typical skeleton features (skeleton prototypes) from different subsequences randomly masked from raw sequences, and contrast the inherent similarity between skeleton features and different prototypes to learn discriminative skeleton representations without using any label. Then, considering that different subsequences within the same sequence usually enjoy strong correlations due to the nature of motion continuity, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences, so as to encourage learning more effective skeleton representations for person re-ID. Extensive experiments validate that the proposed SimMC outperforms most state-of-the-art skeleton-based methods. We further show its scalability and efficiency in enhancing the performance of existing models. Our codes are available at https://github.com/Kali-Hac/SimMC.

preprint2022arXiv

Skeleton Prototype Contrastive Learning with Multi-Level Graph Relation Modeling for Unsupervised Person Re-Identification

Person re-identification (re-ID) via 3D skeletons is an important emerging topic with many merits. Existing solutions rarely explore valuable body-component relations in skeletal structure or motion, and they typically lack the ability to learn general representations with unlabeled skeleton data for person re-ID. This paper proposes a generic unsupervised Skeleton Prototype Contrastive learning paradigm with Multi-level Graph Relation learning (SPC-MGR) to learn effective representations from unlabeled skeletons to perform person re-ID. Specifically, we first construct unified multi-level skeleton graphs to fully model body structure within skeletons. Then we propose a multi-head structural relation layer to comprehensively capture relations of physically-connected body-component nodes in graphs. A full-level collaborative relation layer is exploited to infer collaboration between motion-related body parts at various levels, so as to capture rich body features and recognizable walking patterns. Lastly, we propose a skeleton prototype contrastive learning scheme that clusters feature-correlative instances of unlabeled graph representations and contrasts their inherent similarity with representative skeleton features ("skeleton prototypes") to learn discriminative skeleton representations for person re-ID. Empirical evaluations show that SPC-MGR significantly outperforms several state-of-the-art skeleton-based methods, and it also achieves highly competitive person re-ID performance for more general scenarios.

preprint2022arXiv

SSD-KD: A Self-supervised Diverse Knowledge Distillation Method for Lightweight Skin Lesion Classification Using Dermoscopic Images

Skin cancer is one of the most common types of malignancy, affecting a large population and causing a heavy economic burden worldwide. Over the last few years, computer-aided diagnosis has been rapidly developed and make great progress in healthcare and medical practices due to the advances in artificial intelligence. However, most studies in skin cancer detection keep pursuing high prediction accuracies without considering the limitation of computing resources on portable devices. In this case, knowledge distillation (KD) has been proven as an efficient tool to help improve the adaptability of lightweight models under limited resources, meanwhile keeping a high-level representation capability. To bridge the gap, this study specifically proposes a novel method, termed SSD-KD, that unifies diverse knowledge into a generic KD framework for skin diseases classification. Our method models an intra-instance relational feature representation and integrates it with existing KD research. A dual relational knowledge distillation architecture is self-supervisedly trained while the weighted softened outputs are also exploited to enable the student model to capture richer knowledge from the teacher model. To demonstrate the effectiveness of our method, we conduct experiments on ISIC 2019, a large-scale open-accessed benchmark of skin diseases dermoscopic images. Experiments show that our distilled lightweight model can achieve an accuracy as high as 85% for the classification tasks of 8 different skin diseases with minimal parameters and computing requirements. Ablation studies confirm the effectiveness of our intra- and inter-instance relational knowledge integration strategy. Compared with state-of-the-art knowledge distillation techniques, the proposed method demonstrates improved performances for multi-diseases classification on the large-scale dermoscopy database.

preprint2022arXiv

Stochastic Coded Offloading Scheme for Unmanned Aerial Vehicle-Assisted Edge Computing

Unmanned aerial vehicles (UAVs) have gained wide research interests due to their technological advancement and high mobility. The UAVs are equipped with increasingly advanced capabilities to run computationally intensive applications enabled by machine learning techniques. However, because of both energy and computation constraints, the UAVs face issues hovering in the sky while performing computation due to weather uncertainty. To overcome the computation constraints, the UAVs can partially or fully offload their computation tasks to the edge servers. In ordinary computation offloading operations, the UAVs can retrieve the result from the returned output. Nevertheless, if the UAVs are unable to retrieve the entire result from the edge servers, i.e., straggling edge servers, this operation will fail. In this paper, we propose a coded distributed computing approach for computation offloading to mitigate straggling edge servers. The UAVs can retrieve the returned result when the number of returned copies is greater than or equal to the recovery threshold. There is a shortfall if the returned copies are less than the recovery threshold. To minimize the cost of the network, energy consumption by the UAVs, and prevent over and under subscription of the resources, we devise a two-phase Stochastic Coded Offloading Scheme (SCOS). In the first phase, the appropriate UAVs are allocated to the charging stations amid weather uncertainty. In the second phase, we use the $z$-stage Stochastic Integer Programming (SIP) to optimize the number of computation subtasks offloaded and computed locally, while taking into account the computation shortfall and demand uncertainty. By using a real dataset, the simulation results show that our proposed scheme is fully dynamic, and minimizes the cost of the network and UAV energy consumption amid stochastic uncertainties.

preprint2022arXiv

Stochastic Resource Allocation for Semantic Communication-aided Virtual Transportation Networks in the Metaverse

The physical-virtual world synchronization to develop the Metaverse will require a massive transmission and exchange of data. In this paper, we introduce semantic communication for the development of virtual transportation networks in the Metaverse. Leveraging the perception capabilities of edge devices, virtual service providers (VSPs) can subscribe to their preferred edge devices to receive the semantic data of interest. However, the demands of the VSPs are highly dependent on the users that they are serving. To address the resource allocation problem amid stochastic user demand, we propose a stochastic semantic transmission scheme (SSTS) based on two-stage stochastic integer programming. Using real data captured by edge devices we deploy in Singapore, the simulation results show that SSTS can minimize the transmission cost of the VSPs while accounting for the users' demand uncertainties.

preprint2022arXiv

Towards Counterfactual Image Manipulation via CLIP

Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. An intriguing yet challenging problem arises: Can generative models achieve counterfactual editing against their learnt priors? Due to the lack of counterfactual samples in natural datasets, we investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP), which can offer rich semantic knowledge even for various counterfactual concepts. Different from in-domain manipulation, counterfactual manipulation requires more comprehensive exploitation of semantic knowledge encapsulated in CLIP as well as more delicate handling of editing directions for avoiding being stuck in local minimum or undesired editing. To this end, we design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives. In addition, we design a simple yet effective scheme that explicitly maps CLIP embeddings (of target text) to the latent space and fuses them with latent codes for effective latent code optimization and accurate editing. Extensive experiments show that our design achieves accurate and realistic editing while driving by target texts with various counterfactual concepts.

preprint2022arXiv

What Makes Good Contrastive Learning on Small-Scale Wearable-based Tasks?

Self-supervised learning establishes a new paradigm of learning representations with much fewer or even no label annotations. Recently there has been remarkable progress on large-scale contrastive learning models which require substantial computing resources, yet such models are not practically optimal for small-scale tasks. To fill the gap, we aim to study contrastive learning on the wearable-based activity recognition task. Specifically, we conduct an in-depth study of contrastive learning from both algorithmic-level and task-level perspectives. For algorithmic-level analysis, we decompose contrastive models into several key components and conduct rigorous experimental evaluations to better understand the efficacy and rationale behind contrastive learning. More importantly, for task-level analysis, we show that the wearable-based signals bring unique challenges and opportunities to existing contrastive models, which cannot be readily solved by existing algorithms. Our thorough empirical studies suggest important practices and shed light on future research challenges. In the meantime, this paper presents an open-source PyTorch library \texttt{CL-HAR}, which can serve as a practical tool for researchers. The library is highly modularized and easy to use, which opens up avenues for exploring novel contrastive models quickly in the future.

preprint2021arXiv

CARE: Commonsense-Aware Emotional Response Generation with Latent Concepts

Rationality and emotion are two fundamental elements of humans. Endowing agents with rationality and emotion has been one of the major milestones in AI. However, in the field of conversational AI, most existing models only specialize in one aspect and neglect the other, which often leads to dull or unrelated responses. In this paper, we hypothesize that combining rationality and emotion into conversational agents can improve response quality. To test the hypothesis, we focus on one fundamental aspect of rationality, i.e., commonsense, and propose CARE, a novel model for commonsense-aware emotional response generation. Specifically, we first propose a framework to learn and construct commonsense-aware emotional latent concepts of the response given an input message and a desired emotion. We then propose three methods to collaboratively incorporate the latent concepts into response generation. Experimental results on two large-scale datasets support our hypothesis and show that our model can produce more accurate and commonsense-aware emotional responses and achieve better human ratings than state-of-the-art models that only specialize in one aspect.

preprint2021arXiv

Distilling Causal Effect of Data in Class-Incremental Learning

We propose a causal framework to explain the catastrophic forgetting in Class-Incremental Learning (CIL) and then derive a novel distillation method that is orthogonal to the existing anti-forgetting techniques, such as data replay and feature/label distillation. We first 1) place CIL into the framework, 2) answer why the forgetting happens: the causal effect of the old data is lost in new training, and then 3) explain how the existing techniques mitigate it: they bring the causal effect back. Based on the framework, we find that although the feature/label distillation is storage-efficient, its causal effect is not coherent with the end-to-end feature learning merit, which is however preserved by data replay. To this end, we propose to distill the Colliding Effect between the old and the new data, which is fundamentally equivalent to the causal effect of data replay, but without any cost of replay storage. Thanks to the causal effect analysis, we can further capture the Incremental Momentum Effect of the data stream, removing which can help to retain the old effect overwhelmed by the new data effect, and thus alleviate the forgetting of the old class in testing. Extensive experiments on three CIL benchmarks: CIFAR-100, ImageNet-Sub&Full, show that the proposed causal effect distillation can improve various state-of-the-art CIL methods by a large margin (0.72%--9.06%).

preprint2021arXiv

Keyword-Guided Neural Conversational Model

We study the problem of imposing conversational goals/keywords on open-domain conversational agents, where the agent is required to lead the conversation to a target keyword smoothly and fast. Solving this problem enables the application of conversational agents in many real-world scenarios, e.g., recommendation and psychotherapy. The dominant paradigm for tackling this problem is to 1) train a next-turn keyword classifier, and 2) train a keyword-augmented response retrieval model. However, existing approaches in this paradigm have two limitations: 1) the training and evaluation datasets for next-turn keyword classification are directly extracted from conversations without human annotations, thus, they are noisy and have low correlation with human judgements, and 2) during keyword transition, the agents solely rely on the similarities between word embeddings to move closer to the target keyword, which may not reflect how humans converse. In this paper, we assume that human conversations are grounded on commonsense and propose a keyword-guided neural conversational model that can leverage external commonsense knowledge graphs (CKG) for both keyword transition and response retrieval. Automatic evaluations suggest that commonsense improves the performance of both next-turn keyword prediction and keyword-augmented response retrieval. In addition, both self-play and human evaluations show that our model produces responses with smoother keyword transition and reaches the target keyword faster than competitive baselines.

preprint2021arXiv

Learning Hierarchical Review Graph Representations for Recommendation

The user review data have been demonstrated to be effective in solving different recommendation problems. Previous review-based recommendation methods usually employ sophisticated compositional models, such as Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN), to learn semantic representations from the review data for recommendation. However, these methods mainly capture the local dependency between neighbouring words in a word window, and they treat each review equally. Therefore, they may not be effective in capturing the global dependency between words, and tend to be easily biased by noise review information. In this paper, we propose a novel review-based recommendation model, named Review Graph Neural Network (RGNN). Specifically, RGNN builds a specific review graph for each individual user/item, which provides a global view about the user/item properties to help weaken the biases caused by noise review information. A type-aware graph attention mechanism is developed to learn semantic embeddings of words. Moreover, a personalized graph pooling operator is proposed to learn hierarchical representations of the review graph to form the semantic representation for each user/item. We compared RGNN with state-of-the-art review-based recommendation approaches on two real-world datasets. The experimental results indicate that RGNN consistently outperforms baseline methods, in terms of Mean Square Error (MSE).

preprint2021arXiv

Pre-training Graph Transformer with Multimodal Side Information for Recommendation

Side information of items, e.g., images and text description, has shown to be effective in contributing to accurate recommendations. Inspired by the recent success of pre-training models on natural language and images, we propose a pre-training strategy to learn item representations by considering both item side information and their relationships. We relate items by common user activities, e.g., co-purchase, and construct a homogeneous item graph. This graph provides a unified view of item relations and their associated side information in multimodality. We develop a novel sampling algorithm named MCNSampling to select contextual neighbors for each item. The proposed Pre-trained Multimodal Graph Transformer (PMGT) learns item representations with two objectives: 1) graph structure reconstruction, and 2) masked node feature reconstruction. Experimental results on real datasets demonstrate that the proposed PMGT model effectively exploits the multimodality side information to achieve better accuracies in downstream tasks including item recommendation, item classification, and click-through ratio prediction. We also report a case study of testing the proposed PMGT model in an online setting with 600 thousand users.

preprint2020arXiv

A Survey of Coded Distributed Computing

Distributed computing has become a common approach for large-scale computation of tasks due to benefits such as high reliability, scalability, computation speed, and costeffectiveness. However, distributed computing faces critical issues related to communication load and straggler effects. In particular, computing nodes need to exchange intermediate results with each other in order to calculate the final result, and this significantly increases communication overheads. Furthermore, a distributed computing network may include straggling nodes that run intermittently slower. This results in a longer overall time needed to execute the computation tasks, thereby limiting the performance of distributed computing. To address these issues, coded distributed computing (CDC), i.e., a combination of coding theoretic techniques and distributed computing, has been recently proposed as a promising solution. Coding theoretic techniques have proved effective in WiFi and cellular systems to deal with channel noise. Therefore, CDC may significantly reduce communication load, alleviate the effects of stragglers, provide fault-tolerance, privacy and security. In this survey, we first introduce the fundamentals of CDC, followed by basic CDC schemes. Then, we review and analyze a number of CDC approaches proposed to reduce the communication costs, mitigate the straggler effects, and guarantee privacy and security. Furthermore, we present and discuss applications of CDC in modern computer networks. Finally, we highlight important challenges and promising research directions related to CDC

preprint2020arXiv

Contextualized Graph Attention Network for Recommendation with Item Knowledge Graph

Graph neural networks (GNN) have recently been applied to exploit knowledge graph (KG) for recommendation. Existing GNN-based methods explicitly model the dependency between an entity and its local graph context in KG (i.e., the set of its first-order neighbors), but may not be effective in capturing its non-local graph context (i.e., the set of most related high-order neighbors). In this paper, we propose a novel recommendation framework, named Contextualized Graph Attention Network (CGAT), which can explicitly exploit both local and non-local graph context information of an entity in KG. Specifically, CGAT captures the local context information by a user-specific graph attention mechanism, considering a user's personalized preferences on entities. Moreover, CGAT employs a biased random walk sampling process to extract the non-local context of an entity, and utilizes a Recurrent Neural Network (RNN) to model the dependency between the entity and its non-local contextual entities. To capture the user's personalized preferences on items, an item-specific attention mechanism is also developed to model the dependency between a target item and the contextual items extracted from the user's historical behaviors. Experimental results on real datasets demonstrate the effectiveness of CGAT, compared with state-of-the-art KG-based recommendation methods.

preprint2020arXiv

EEG-Based Emotion Recognition Using Regularized Graph Neural Networks

Electroencephalography (EEG) measures the neuronal activities in different brain regions via electrodes. Many existing studies on EEG-based emotion recognition do not fully exploit the topology of EEG channels. In this paper, we propose a regularized graph neural network (RGNN) for EEG-based emotion recognition. RGNN considers the biological topology among different brain regions to capture both local and global relations among different EEG channels. Specifically, we model the inter-channel relations in EEG signals via an adjacency matrix in a graph neural network where the connection and sparseness of the adjacency matrix are inspired by neuroscience theories of human brain organization. In addition, we propose two regularizers, namely node-wise domain adversarial training (NodeDAT) and emotion-aware distribution learning (EmotionDL), to better handle cross-subject EEG variations and noisy labels, respectively. Extensive experiments on two public datasets, SEED and SEED-IV, demonstrate the superior performance of our model than state-of-the-art models in most experimental settings. Moreover, ablation studies show that the proposed adjacency matrix and two regularizers contribute consistent and significant gain to the performance of our RGNN model. Finally, investigations on the neuronal activities reveal important brain regions and inter-channel relations for EEG-based emotion recognition.

preprint2020arXiv

Federated Learning in Mobile Edge Networks: A Comprehensive Survey

In recent years, mobile devices are equipped with increasingly advanced sensing and computing capabilities. Coupled with advancements in Deep Learning (DL), this opens up countless possibilities for meaningful applications. Traditional cloudbased Machine Learning (ML) approaches require the data to be centralized in a cloud server or data center. However, this results in critical issues related to unacceptable latency and communication inefficiency. To this end, Mobile Edge Computing (MEC) has been proposed to bring intelligence closer to the edge, where data is produced. However, conventional enabling technologies for ML at mobile edge networks still require personal data to be shared with external parties, e.g., edge servers. Recently, in light of increasingly stringent data privacy legislations and growing privacy concerns, the concept of Federated Learning (FL) has been introduced. In FL, end devices use their local data to train an ML model required by the server. The end devices then send the model updates rather than raw data to the server for aggregation. FL can serve as an enabling technology in mobile edge networks since it enables the collaborative training of an ML model and also enables DL for mobile edge network optimization. However, in a large-scale and complex mobile edge network, heterogeneous devices with varying constraints are involved. This raises challenges of communication costs, resource allocation, and privacy and security in the implementation of FL at scale. In this survey, we begin with an introduction to the background and fundamentals of FL. Then, we highlight the aforementioned challenges of FL implementation and review existing solutions. Furthermore, we present the applications of FL for mobile edge network optimization. Finally, we discuss the important challenges and future research directions in FL

preprint2020arXiv

Federated Learning in the Sky: Aerial-Ground Air Quality Sensing Framework with UAV Swarms

Due to air quality significantly affects human health, it is becoming increasingly important to accurately and timely predict the Air Quality Index (AQI). To this end, this paper proposes a new federated learning-based aerial-ground air quality sensing framework for fine-grained 3D air quality monitoring and forecasting. Specifically, in the air, this framework leverages a light-weight Dense-MobileNet model to achieve energy-efficient end-to-end learning from haze features of haze images taken by Unmanned Aerial Vehicles (UAVs) for predicting AQI scale distribution. Furthermore, the Federated Learning Framework not only allows various organizations or institutions to collaboratively learn a well-trained global model to monitor AQI without compromising privacy, but also expands the scope of UAV swarms monitoring. For ground sensing systems, we propose a Graph Convolutional neural network-based Long Short-Term Memory (GC-LSTM) model to achieve accurate, real-time and future AQI inference. The GC-LSTM model utilizes the topological structure of the ground monitoring station to capture the spatio-temporal correlation of historical observation data, which helps the aerial-ground sensing system to achieve accurate AQI inference. Through extensive case studies on a real-world dataset, numerical results show that the proposed framework can achieve accurate and energy-efficient AQI sensing without compromising the privacy of raw data.

preprint2020arXiv

Incentive Mechanism Design for Resource Sharing in Collaborative Edge Learning

In 5G and Beyond networks, Artificial Intelligence applications are expected to be increasingly ubiquitous. This necessitates a paradigm shift from the current cloud-centric model training approach to the Edge Computing based collaborative learning scheme known as edge learning, in which model training is executed at the edge of the network. In this article, we first introduce the principles and technologies of collaborative edge learning. Then, we establish that a successful, scalable implementation of edge learning requires the communication, caching, computation, and learning resources (3C-L) of end devices and edge servers to be leveraged jointly in an efficient manner. However, users may not consent to contribute their resources without receiving adequate compensation. In consideration of the heterogeneity of edge nodes, e.g., in terms of available computation resources, we discuss the challenges of incentive mechanism design to facilitate resource sharing for edge learning. Furthermore, we present a case study involving optimal auction design using Deep Learning to price fresh data contributed for edge learning. The performance evaluation shows the revenue maximizing properties of our proposed auction over the benchmark schemes.

preprint2020arXiv

Joint Auction-Coalition Formation Framework for Communication-Efficient Federated Learning in UAV-Enabled Internet of Vehicles

Due to the advanced capabilities of the Internet of Vehicles (IoV) components such as vehicles, Roadside Units (RSUs) and smart devices as well as the increasing amount of data generated, Federated Learning (FL) becomes a promising tool given that it enables privacy-preserving machine learning that can be implemented in the IoV. However, the performance of the FL suffers from the failure of communication links and missing nodes, especially when continuous exchanges of model parameters are required. Therefore, we propose the use of Unmanned Aerial Vehicles (UAVs) as wireless relays to facilitate the communications between the IoV components and the FL server and thus improving the accuracy of the FL. However, a single UAV may not have sufficient resources to provide services for all iterations of the FL process. In this paper, we present a joint auction-coalition formation framework to solve the allocation of UAV coalitions to groups of IoV components. Specifically, the coalition formation game is formulated to maximize the sum of individual profits of the UAVs. The joint auction-coalition formation algorithm is proposed to achieve a stable partition of UAV coalitions in which an auction scheme is applied to solve the allocation of UAV coalitions. The auction scheme is designed to take into account the preferences of IoV components over heterogeneous UAVs. The simulation results show that the grand coalition, where all UAVs join a single coalition, is not always stable due to the profit-maximizing behavior of the UAVs. In addition, we show that as the cooperation cost of the UAVs increases, the UAVs prefer to support the IoV components independently and not to form any coalition.

preprint2020arXiv

Learning to Segment the Tail

Real-world visual recognition requires handling the extreme sample imbalance in large-scale long-tailed data. We propose a "divide&conquer" strategy for the challenging LVIS task: divide the whole data into balanced parts and then apply incremental learning to conquer each one. This derives a novel learning paradigm: class-incremental few-shot learning, which is especially effective for the challenge evolving over time: 1) the class imbalance among the old-class knowledge review and 2) the few-shot data in new-class learning. We call our approach Learning to Segment the Tail (LST). In particular, we design an instance-level balanced replay scheme, which is a memory-efficient approximation to balance the instance-level samples from the old-class images. We also propose to use a meta-module for new-class learning, where the module parameters are shared across incremental phases, gaining the learning-to-learn knowledge incrementally, from the data-rich head to the data-poor tail. We empirically show that: at the expense of a little sacrifice of head-class forgetting, we can gain a significant 8.3% AP improvement for the tail classes with less than 10 instances, achieving an overall 2.0% AP boost for the whole 1,230 classes.

preprint2020arXiv

Scalable and Communication-efficient Decentralized Federated Edge Learning with Multi-blockchain Framework

The emerging Federated Edge Learning (FEL) technique has drawn considerable attention, which not only ensures good machine learning performance but also solves "data island" problems caused by data privacy concerns. However, large-scale FEL still faces following crucial challenges: (i) there lacks a secure and communication-efficient model training scheme for FEL; (2) there is no scalable and flexible FEL framework for updating local models and global model sharing (trading) management. To bridge the gaps, we first propose a blockchain-empowered secure FEL system with a hierarchical blockchain framework consisting of a main chain and subchains. This framework can achieve scalable and flexible decentralized FEL by individually manage local model updates or model sharing records for performance isolation. A Proof-of-Verifying consensus scheme is then designed to remove low-quality model updates and manage qualified model updates in a decentralized and secure manner, thereby achieving secure FEL. To improve communication efficiency of the blockchain-empowered FEL, a gradient compression scheme is designed to generate sparse but important gradients to reduce communication overhead without compromising accuracy, and also further strengthen privacy preservation of training data. The security analysis and numerical results indicate that the proposed schemes can achieve secure, scalable, and communication-efficient decentralized FEL.

preprint2020arXiv

Structure-Aware Generation Network for Recipe Generation from Images

Sharing food has become very popular with the development of social media. For many real-world applications, people are keen to know the underlying recipes of a food item. In this paper, we are interested in automatically generating cooking instructions for food. We investigate an open research task of generating cooking instructions based on only food images and ingredients, which is similar to the image captioning task. However, compared with image captioning datasets, the target recipes are long-length paragraphs and do not have annotations on structure information. To address the above limitations, we propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task. Our approach brings together several novel ideas in a systematic framework: (1) exploiting an unsupervised learning approach to obtain the sentence-level tree structure labels before training; (2) generating trees of target recipes from images with the supervision of tree structure labels learned from (1); and (3) integrating the inferred tree structures with the recipe generation procedure. Our proposed model can produce high-quality and coherent recipes, and achieve the state-of-the-art performance on the benchmark Recipe1M dataset.

preprint2020arXiv

Towards Federated Learning in UAV-Enabled Internet of Vehicles: A Multi-Dimensional Contract-Matching Approach

Coupled with the rise of Deep Learning, the wealth of data and enhanced computation capabilities of Internet of Vehicles (IoV) components enable effective Artificial Intelligence (AI) based models to be built. Beyond ground data sources, Unmanned Aerial Vehicles (UAVs) based service providers for data collection and AI model training, i.e., Drones-as-a-Service, is increasingly popular in recent years. However, the stringent regulations governing data privacy potentially impedes data sharing across independently owned UAVs. To this end, we propose the adoption of a Federated Learning (FL) based approach to enable privacy-preserving collaborative Machine Learning across a federation of independent DaaS providers for the development of IoV applications, e.g., for traffic prediction and car park occupancy management. Given the information asymmetry and incentive mismatches between the UAVs and model owners, we leverage on the self-revealing properties of a multi-dimensional contract to ensure truthful reporting of the UAV types, while accounting for the multiple sources of heterogeneity, e.g., in sensing, computation, and transmission costs. Then, we adopt the Gale-Shapley algorithm to match the lowest cost UAV to each subregion. The simulation results validate the incentive compatibility of our contract design, and shows the efficiency of our matching, thus guaranteeing profit maximization for the model owner amid information asymmetry.