Researcher profile

Huawei Shen

Huawei Shen contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
29works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

29 published item(s)

preprint2026arXiv

Beyond Entangled Planning: Task-Decoupled Planning for Long-Horizon Agents

Recent advances in large language models (LLMs) have enabled agents to autonomously execute complex, long-horizon tasks, yet planning remains a primary bottleneck for reliable task execution. Existing methods typically fall into two paradigms: step-wise planning, which is reactive but often short-sighted; and one-shot planning, which generates a complete plan upfront yet is brittle to execution errors. Crucially, both paradigms suffer from entangled contexts, where the agent must reason over a monolithic history spanning multiple sub-tasks. This entanglement increases cognitive load and lets local errors propagate across otherwise independent decisions, making recovery computationally expensive. To address this, we propose Task-Decoupled Planning (TDP), a training-free framework that replaces entangled reasoning with task decoupling. TDP decomposes tasks into a directed acyclic graph (DAG) of sub-goals via a Supervisor. Using a Planner and Executor with scoped contexts, TDP confines reasoning and replanning to the active sub-task. This isolation prevents error propagation and corrects deviations locally without disrupting the workflow. Results on TravelPlanner, ScienceWorld, and HotpotQA show that TDP outperforms strong baselines while reducing token consumption by up to 82%, demonstrating that sub-task decoupling improves both robustness and efficiency for long-horizon agents.

preprint2026arXiv

Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems

Multi-Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query-level workflow generation is not always necessary, since a small set of top-K best task-level workflows together already covers equivalent or even more queries. We further find that exhaustive execution-based task-level evaluation is both extremely token-costly and frequently unreliable. Inspired by the idea of self-evolution and generative reward modeling, we propose a low-cost task-level generation framework \textbf{SCALE}, which means \underline{\textbf{S}}elf prediction of the optimizer with few shot \underline{\textbf{CAL}}ibration for \underline{\textbf{E}}valuation instead of full validation execution. Extensive experiments demonstrate that \textbf{SCALE} maintains competitive performance, with an average degradation of just 0.61\% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83\%.

preprint2026arXiv

GIFT: Games as Informal Training for Generalizable LLMs

While Large Language Models (LLMs) have achieved remarkable success in formal learning tasks such as mathematics and code generation, they still struggle with the "practical wisdom" and generalizable intelligence, such as strategic creativity and social reasoning, that characterize human cognition. This gap arises from a lack of informal learning, which thrives on interactive feedback rather than goal-oriented instruction. In this paper, we propose treating Games as a primary environment for LLM informal learning, leveraging their intrinsic reward signals and abstracted complexity to cultivate diverse competencies. To address the performance degradation observed in multi-task learning, we introduce a Nested Training Framework. Unlike naive task mixing optimizing an implicit "OR" objective, our framework employs sequential task composition to enforce an explicit "AND" objective, compelling the model to master multiple abilities simultaneously to achieve maximal rewards. Using GRPO-based reinforcement learning across Matrix Games, TicTacToe, and Who's the Spy games, we demonstrate that integrating game-based informal learning not only prevents task interference but also significantly bolsters the model's generalization across broad ability-oriented benchmarks. The framework and implementation are publicly available.

preprint2026arXiv

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and reinforcement learning in latent space remains highly unstable. We study this problem through the lens of Group Relative Policy Optimization (GRPO), and show that directly adapting GRPO to latent reasoning is fundamentally non-trivial: latent reasoning changes both the probability density and the sampling mechanism, causing three coupled bottlenecks: absence of intrinsic latent manifolds, where unconstrained exploration pushes rollouts off the valid latent manifold; exploration-optimization misalignment, where trajectory-level rewards can induce incorrect token-level updates; and latent mixture non-closure, where jointly reinforcing multiple correct latent paths can produce an invalid averaged state. To address them, we propose \textbf{Latent-GRPO}, which combines invalid-sample advantage masking, one-sided noise sampling, and optimal correct-path first-token selection. Across four low-difficulty benchmarks (e.g., GSM8K-Aug) and four high-difficulty benchmarks (e.g., AIME), Latent-GRPO improves over its latent initialization by 7.86 Pass@1 points on low-difficulty tasks and surpasses explicit GRPO by 4.27 points on high-difficulty tasks while using 3--4$\times$ shorter reasoning chains. It also achieves stronger pass@$k$ performance under Gumbel sampling. These results establish Latent-GRPO as an effective approach for stable and efficient latent reasoning.

preprint2026arXiv

Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization

Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.

preprint2026arXiv

Multi-Personality Generation of LLMs at Decoding-time

Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a "free lunch" to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at https://github.com/Libra117/MPG .

preprint2026arXiv

Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

Large language models (LLMs) exhibit exceptional performance but pose inherent risks of generating toxic content, restricting their safe deployment. While traditional methods (e.g., alignment) adjust output preferences, they fail to eliminate underlying toxic regions in parameters, leaving models vulnerable to adversarial attacks. Prior mechanistic studies characterize toxic regions as "toxic vectors" or "layer-wise subspaces", yet our analysis identifies critical limitations: i) Removed toxic vectors can be reconstructed via linear combinations of non-toxic vectors, demanding targeting of entire toxic subspace; ii) Contrastive objective over limited samples inject noise into layer-wise subspaces, hindering stable extraction. These highlight the challenge of identifying robust toxic subspace and removing them. Therefore, we propose GLOSS (GLobal tOxic Subspace Suppression), a lightweight method that mitigates toxicity by identifying and eliminating this global subspace from FFN parameters. Experiments on LLMs (e.g., Qwen3) show GLOSS achieves SOTA detoxification while preserving general capabilities without requiring large-scale retraining. WARNING: This paper contains context which is toxic in nature.

preprint2026arXiv

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.

preprint2026arXiv

The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis

Test-time scaling via explicit reasoning trajectories significantly boosts large language model (LLM) performance but often triggers overthinking. To explore this, we analyze reasoning through two lenses: Reasoning Length Dynamics, which reveals a compensatory trade-off between thinking and answer content length that eventually leads to thinking redundancy, and Reasoning Semantic Dynamics, which identifies semantic convergence and repetitive oscillations. These dynamics uncover an instance-specific Reasoning Completion Point (RCP), beyond which computation continues without further performance gain. Since the RCP varies across instances, we propose a Reasoning Completion Point Detector (RCPD), an inference-time early-exit method that identifies the RCP by monitoring the rank dynamics of termination tokens (e.g., </think>). Across AIME and GPQA benchmarks using Qwen3 and DeepSeek-R1, RCPD reduces token usage by up to 44% while preserving accuracy, offering a principled approach to efficient test-time scaling.

preprint2025arXiv

Large Language Model Sourcing: A Survey

Due to the black-box nature of large language models (LLMs) and the realism of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement have become significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation organized around four interrelated dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. Moreover, a unified dual-paradigm taxonomy is proposed that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.

preprint2022arXiv

A Roadmap for Big Model

With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view.

preprint2022arXiv

Adversarial for Social Privacy: A Poisoning Strategy to Degrade User Identity Linkage

Privacy issues on social networks have been extensively discussed in recent years. The user identity linkage (UIL) task, aiming at finding corresponding users across different social networks, would be a threat to privacy if unethically applied. The sensitive user information might be detected through connected identities. A promising and novel solution to this issue is to design an adversarial strategy to degrade the matching performance of UIL models. However, most existing adversarial attacks on graphs are designed for models working in a single network, while UIL is a cross-network learning task. Meanwhile, privacy protection against UIL works unilaterally in real-world scenarios, i.e., the service provider can only add perturbations to its own network to protect its users from being linked. To tackle these challenges, this paper proposes a novel adversarial attack strategy that poisons one target network to prevent its nodes from being linked to other networks by UIL algorithms. Specifically, we reformalize the UIL problem in the perspective of kernelized topology consistency and convert the attack objective to maximizing the structural changes within the target network before and after attacks. A novel graph kernel is then defined with Earth mover&#39;s distance (EMD) on the edge-embedding space. In terms of efficiency, a fast attack strategy is proposed by greedy searching and replacing EMD with its lower bound. Results on three real-world datasets indicate that the proposed attacks can best fool a wide range of UIL models and reach a balance between attack effectiveness and imperceptibility.

preprint2022arXiv

Conditional GANs with Auxiliary Discriminative Classifier

Conditional generative models aim to learn the underlying joint distribution of data and labels to achieve conditional data generation. Among them, the auxiliary classifier generative adversarial network (AC-GAN) has been widely used, but suffers from the problem of low intra-class diversity of the generated samples. The fundamental reason pointed out in this paper is that the classifier of AC-GAN is generator-agnostic, which therefore cannot provide informative guidance for the generator to approach the joint distribution, resulting in a minimization of the conditional entropy that decreases the intra-class diversity. Motivated by this understanding, we propose a novel conditional GAN with an auxiliary discriminative classifier (ADC-GAN) to resolve the above problem. Specifically, the proposed auxiliary discriminative classifier becomes generator-aware by recognizing the class-labels of the real data and the generated data discriminatively. Our theoretical analysis reveals that the generator can faithfully learn the joint distribution even without the original discriminator, making the proposed ADC-GAN robust to the value of the coefficient hyperparameter and the selection of the GAN loss, and stable during training. Extensive experimental results on synthetic and real-world datasets demonstrate the superiority of ADC-GAN in conditional generative modeling compared to state-of-the-art classifier-based and projection-based conditional GANs.

preprint2022arXiv

Few-Shot Stance Detection via Target-Aware Prompt Distillation

Stance detection aims to identify whether the author of a text is in favor of, against, or neutral to a given target. The main challenge of this task comes two-fold: few-shot learning resulting from the varying targets and the lack of contextual information of the targets. Existing works mainly focus on solving the second issue by designing attention-based models or introducing noisy external knowledge, while the first issue remains under-explored. In this paper, inspired by the potential capability of pre-trained language models (PLMs) serving as knowledge bases and few-shot learners, we propose to introduce prompt-based fine-tuning for stance detection. PLMs can provide essential contextual information for the targets and enable few-shot learning via prompts. Considering the crucial role of the target in stance detection task, we design target-aware prompts and propose a novel verbalizer. Instead of mapping each label to a concrete word, our verbalizer maps each label to a vector and picks the label that best captures the correlation between the stance and the target. Moreover, to alleviate the possible defect of dealing with varying targets with a single hand-crafted prompt, we propose to distill the information learned from multiple prompts. Experimental results show the superior performance of our proposed model in both full-data and few-shot scenarios.

preprint2022arXiv

INMO: A Model-Agnostic and Scalable Module for Inductive Collaborative Filtering

Collaborative filtering is one of the most common scenarios and popular research topics in recommender systems. Among existing methods, latent factor models, i.e., learning a specific embedding for each user/item by reconstructing the observed interaction matrix, have shown excellent performances. However, such user-specific and item-specific embeddings are intrinsically transductive, making it difficult to deal with new users and new items unseen during training. Besides, the number of model parameters heavily depends on the number of all users and items, restricting its scalability to real-world applications. To solve the above challenges, in this paper, we propose a novel model-agnostic and scalable Inductive Embedding Module for collaborative filtering, namely INMO. INMO generates the inductive embeddings for users (items) by characterizing their interactions with some template items (template users), instead of employing an embedding lookup table. Under the theoretical analysis, we further propose an effective indicator for the selection of template users/items. Our proposed INMO can be attached to existing latent factor models as a pre-module, inheriting the expressiveness of backbone models, while bringing the inductive ability and reducing model parameters. We validate the generality of INMO by attaching it to both Matrix Factorization (MF) and LightGCN, which are two representative latent factor models for collaborative filtering. Extensive experiments on three public benchmarks demonstrate the effectiveness and efficiency of INMO in both transductive and inductive recommendation scenarios.

preprint2022arXiv

Learning node embeddings via summary graphs: a brief theoretical analysis

Graph representation learning plays an important role in many graph mining applications, but learning embeddings of large-scale graphs remains a problem. Recent works try to improve scalability via graph summarization -- i.e., they learn embeddings on a smaller summary graph, and then restore the node embeddings of the original graph. However, all existing works depend on heuristic designs and lack theoretical analysis. Different from existing works, we contribute an in-depth theoretical analysis of three specific embedding learning methods based on introduced kernel matrix, and reveal that learning embeddings via graph summarization is actually learning embeddings on a approximate graph constructed by the configuration model. We also give analysis about approximation error. To the best of our knowledge, this is the first work to give theoretical analysis of this approach. Furthermore, our analysis framework gives interpretation of some existing methods and provides great insights for future work on this problem.

preprint2022arXiv

LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback

Pseudo-relevance feedback (PRF) has proven to be an effective query reformulation technique to improve retrieval accuracy. It aims to alleviate the mismatch of linguistic expressions between a query and its potential relevant documents. Existing PRF methods independently treat revised queries originating from the same query but using different numbers of feedback documents, resulting in severe query drift. Without comparing the effects of two different revisions from the same query, a PRF model may incorrectly focus on the additional irrelevant information increased in the more feedback, and thus reformulate a query that is less effective than the revision using the less feedback. Ideally, if a PRF model can distinguish between irrelevant and relevant information in the feedback, the more feedback documents there are, the better the revised query will be. To bridge this gap, we propose the Loss-over-Loss (LoL) framework to compare the reformulation losses between different revisions of the same query during training. Concretely, we revise an original query multiple times in parallel using different amounts of feedback and compute their reformulation losses. Then, we introduce an additional regularization loss on these reformulation losses to penalize revisions that use more feedback but gain larger losses. With such comparative regularization, the PRF model is expected to learn to suppress the extra increased irrelevant information by comparing the effects of different revised queries. Further, we present a differentiable query reformulation method to implement this framework. This method revises queries in the vector space and directly optimizes the retrieval performance of query vectors, applicable for both sparse and dense retrieval models. Empirical evaluation demonstrates the effectiveness and robustness of our method for two typical sparse and dense retrieval models.

preprint2022arXiv

Match-Prompt: Improving Multi-task Generalization Ability for Neural Text Matching via Prompt Learning

Text matching is a fundamental technique in both information retrieval and natural language processing. Text matching tasks share the same paradigm that determines the relationship between two given texts. The relationships vary from task to task, e.g.~relevance in document retrieval, semantic alignment in paraphrase identification and answerable judgment in question answering. However, the essential signals for text matching remain in a finite scope, i.e.~exact matching, semantic matching, and inference matching. Ideally, a good text matching model can learn to capture and aggregate these signals for different matching tasks to achieve competitive performance, while recent state-of-the-art text matching models, e.g.~Pre-trained Language Models (PLMs), are hard to generalize. It is because the end-to-end supervised learning on task-specific dataset makes model overemphasize the data sample bias and task-specific signals instead of the essential matching signals. To overcome this problem, we adopt a specialization-generalization training strategy and refer to it as Match-Prompt. In specialization stage, descriptions of different matching tasks are mapped to a few prompt tokens. In generalization stage, matching model explores the essential matching signals by being trained on diverse matching tasks. High diverse matching tasks avoid model fitting the data bias on a specific task, so that model can focus on learning the essential matching signals. Meanwhile, the prompt tokens obtained in the first step help the model distinguish different task-specific matching signals. Experimental results on public datasets show that Match-Prompt can improve multi-task generalization capability of PLMs in text matching and yield better in-domain multi-task, out-of-domain multi-task and new task adaptation performance than multi-task and task-specific models trained by previous fine-tuning paradigm.

preprint2022arXiv

Modeling and Predicting Popularity Dynamics via Deep Learning Attention Mechanism

An ability to predict the popularity dynamics of individual items within a complex evolving system has important implications in a wide range of domains. Here we propose a deep learning attention mechanism to model the process through which individual items gain their popularity. We analyze the interpretability of the model with the four key phenomena confirmed independently in the previous studies of long-term popularity dynamics quantification, including the intrinsic quality, the aging effect, the recency effect and the Matthew effect. We analyze the effectiveness of introducing attention model in popularity dynamics prediction. Extensive experiments on a real-large citation data set demonstrate that the designed deep learning attention mechanism possesses remarkable power at predicting the long-term popularity dynamics. It consistently outperforms the existing methods, and achieves a significant performance improvement.

preprint2022arXiv

Multi-scale Anomaly Detection for Big Time Series of Industrial Sensors

Given a multivariate big time series, can we detect anomalies as soon as they occur? Many existing works detect anomalies by learning how much a time series deviates away from what it should be in the reconstruction framework. However, most models have to cut the big time series into small pieces empirically since optimization algorithms cannot afford such a long series. The question is raised: do such cuts pollute the inherent semantic segments, like incorrect punctuation in sentences? Therefore, we propose a reconstruction-based anomaly detection method, MissGAN, iteratively learning to decode and encode naturally smooth time series in coarse segments, and finding out a finer segment from low-dimensional representations based on HMM. As a result, learning from multi-scale segments, MissGAN can reconstruct a meaningful and robust time series, with the help of adversarial regularization and extra conditional states. MissGAN does not need labels or only needs labels of normal instances, making it widely applicable. Experiments on industrial datasets of real water network sensors show our MissGAN outperforms the baselines with scalability. Besides, we use a case study on the CMU Motion dataset to demonstrate that our model can well distinguish unexpected gestures from a given conditional motion.

preprint2022arXiv

PREP: Pre-training with Temporal Elapse Inference for Popularity Prediction

Predicting the popularity of online content is a fundamental problem in various applications. One practical challenge takes roots in the varying length of observation time or prediction horizon, i.e., a good model for popularity prediction is desired to handle various prediction settings. However, most existing methods adopt a separate training paradigm for each prediction setting and the obtained model for one setting is difficult to be generalized to others, causing a great waste of computational resources and a large demand for downstream labels. To solve the above issues, we propose a novel pre-training framework for popularity prediction, namely PREP, aiming to pre-train a general representation model from the readily available unlabeled diffusion data, which can be effectively transferred into various prediction settings. We design a novel pretext task for pre-training, i.e., temporal elapse inference for two randomly sampled time slices of popularity dynamics, impelling the representation model to learn intrinsic knowledge about popularity dynamics. Experimental results conducted on two real datasets demonstrate the generalization and efficiency of the pre-training framework for different popularity prediction task settings.

preprint2022arXiv

Self-Supervised GANs with Label Augmentation

Recently, transformation-based self-supervised learning has been applied to generative adversarial networks (GANs) to mitigate catastrophic forgetting in the discriminator by introducing a stationary learning environment. However, the separate self-supervised tasks in existing self-supervised GANs cause a goal inconsistent with generative modeling due to the fact that their self-supervised classifiers are agnostic to the generator distribution. To address this problem, we propose a novel self-supervised GAN that unifies the GAN task with the self-supervised task by augmenting the GAN labels (real or fake) via self-supervision of data transformation. Specifically, the original discriminator and self-supervised classifier are unified into a label-augmented discriminator that predicts the augmented labels to be aware of both the generator distribution and the data distribution under every transformation, and then provide the discrepancy between them to optimize the generator. Theoretically, we prove that the optimal generator could converge to replicate the real data distribution. Empirically, we show that the proposed method significantly outperforms previous self-supervised and data augmentation GANs on both generative modeling and representation learning across benchmark datasets.

preprint2021arXiv

Beyond Low-frequency Information in Graph Convolutional Networks

Graph neural networks (GNNs) have been proven to be effective in various network-related tasks. Most existing GNNs usually exploit the low-frequency signals of node features, which gives rise to one fundamental question: is the low-frequency information all we need in the real world applications? In this paper, we first present an experimental investigation assessing the roles of low-frequency and high-frequency signals, where the results clearly show that exploring low-frequency signal only is distant from learning an effective node representation in different scenarios. How can we adaptively learn more information beyond low-frequency information in GNNs? A well-informed answer can help GNNs enhance the adaptability. We tackle this challenge and propose a novel Frequency Adaptation Graph Convolutional Networks (FAGCN) with a self-gating mechanism, which can adaptively integrate different signals in the process of message passing. For a deeper understanding, we theoretically analyze the roles of low-frequency signals and high-frequency signals on learning node representations, which further explains why FAGCN can perform well on different types of networks. Extensive experiments on six real-world networks validate that FAGCN not only alleviates the over-smoothing problem, but also has advantages over the state-of-the-arts.

preprint2021arXiv

How Medical Crowdfunding Helps People? A Large-scale Case Study on Waterdrop Fundraising

While online medical crowdfunding achieved tremendous success, quantitative study about whether and how medical crowdfunding helps people remains little explored. In this paper, we empirically study how online medical crowdfunding helps people using more than 27, 000 fundraising cases in Waterdrop Fundraising, one of the most popular online medical crowdfunding platforms in China. We find that the amount of money obtained by fundraisers is broadly distributed, i.e., a majority of lowly donated cases coexist with a handful of very successful cases. We further investigate the factors that potentially correlate with the success of medical fundraising cases. Profile information of fundraising cases, e.g., geographic information of fundraisers, affects the donated amounts, since detailed description may increase the credibility of a fundraising case. One prominent finding lies in the effect of social network on the success of fundraising cases: the spread of fundraising information along social network is a key factor of fundraising success, and the social capital of fundraisers play an important role in fundraising. Finally, we conduct prediction of donations using machine learning models, verifying the effect of potential factors to the success of medical crowdfunding. Altogether, this work presents a data-driven view of medical fundraising on the web and opens a door to understanding medical crowdfunding.

preprint2021arXiv

Modelling Universal Order Book Dynamics in Bitcoin Market

Understanding the emergence of universal features such as the stylized facts in markets is a long-standing challenge that has drawn much attention from economists and physicists. Most existing models, such as stochastic volatility models, focus mainly on price changes, neglecting the complex trading dynamics. Recently, there are increasing studies on order books, thanks to the availability of large-scale trading datasets, aiming to understand the underlying mechanisms governing the market dynamics. In this paper, we collect order-book datasets of Bitcoin platforms across three countries over millions of users and billions of daily turnovers. We find a 1+1D field theory, govern by a set of KPZ-like stochastic equations, predicts precisely the order book dynamics observed in empirical data. Despite the microscopic difference of markets, we argue the proposed effective field theory captures the correct universality class of market dynamics. We also show that the model agrees with the existing stochastic volatility models at the long-wavelength limit.

preprint2020arXiv

ANAE: Learning Node Context Representation for Attributed Network Embedding

Attributed network embedding aims to learn low-dimensional node representations from both network structure and node attributes. Existing methods can be categorized into two groups: (1) the first group learns two separated node representations from network structure and node attribute respectively and concatenates them together; (2) the other group obtains node representations by translating node attributes into network structure or vice versa. However, both groups have their drawbacks. The first group neglects the correlation between network structure and node attributes, while the second group assumes strong dependence between these two types of information. In this paper, we address attributed network embedding from a novel perspective, i.e., learning node context representation for each node via modeling its attributed local subgraph. To achieve this goal, we propose a novel attributed network auto-encoder framework, namely ANAE. For a target node, ANAE first aggregates the attribute information from its attributed local subgraph, obtaining its low-dimensional representation. Next, ANAE diffuses the representation of the target node to nodes in its local subgraph to reconstruct their attributes. Such an encoder-decoder framework allows the learned representations to better preserve the context information manifested in both network structure and node attributes, thus having high capacity to learn good node representations for attributed network. Extensive experimental results on real-world datasets demonstrate that the proposed framework outperforms the state-of-the-art approaches at the tasks of link prediction and node classification.

preprint2020arXiv

Graph Convolutional Networks using Heat Kernel for Semi-supervised Learning

Graph convolutional networks gain remarkable success in semi-supervised learning on graph structured data. The key to graph-based semisupervised learning is capturing the smoothness of labels or features over nodes exerted by graph structure. Previous methods, spectral methods and spatial methods, devote to defining graph convolution as a weighted average over neighboring nodes, and then learn graph convolution kernels to leverage the smoothness to improve the performance of graph-based semi-supervised learning. One open challenge is how to determine appropriate neighborhood that reflects relevant information of smoothness manifested in graph structure. In this paper, we propose GraphHeat, leveraging heat kernel to enhance low-frequency filters and enforce smoothness in the signal variation on the graph. GraphHeat leverages the local structure of target node under heat diffusion to determine its neighboring nodes flexibly, without the constraint of order suffered by previous methods. GraphHeat achieves state-of-the-art results in the task of graph-based semi-supervised classification across three benchmark datasets: Cora, Citeseer and Pubmed.

preprint2020arXiv

Label-Consistency based Graph Neural Networks for Semi-supervised Node Classification

Graph neural networks (GNNs) achieve remarkable success in graph-based semi-supervised node classification, leveraging the information from neighboring nodes to improve the representation learning of target node. The success of GNNs at node classification depends on the assumption that connected nodes tend to have the same label. However, such an assumption does not always work, limiting the performance of GNNs at node classification. In this paper, we propose label-consistency based graph neural network(LC-GNN), leveraging node pairs unconnected but with the same labels to enlarge the receptive field of nodes in GNNs. Experiments on benchmark datasets demonstrate the proposed LC-GNN outperforms traditional GNNs in graph-based semi-supervised node classification.We further show the superiority of LC-GNN in sparse scenarios with only a handful of labeled nodes.

preprint2020arXiv

Summarizing graphs using the configuration model

Given a large graph, how can we summarize it with fewer nodes and edges while maintaining its key properties, such as spectral property? Although graphs play more and more important roles in many real-world applications, the growth of their size presents great challenges to graph analysis. As a solution, graph summarization, which aims to find a compact representation that preserves the important properties of a given graph, has received much attention, and numerous algorithms have been developed for it. However, most of the algorithms adopt the uniform reconstruction scheme, which is based on an unrealistic assumption that edges are uniformly distributed. In this work, we propose a novel and realistic reconstruction scheme, which preserves the degree of nodes, and we develop an efficient graph summarization algorithm called DPGS based on the Minimum Description Length principle. We theoretically analyze the difference between the original and summary graphs from a spectral perspective, and we perform extensive experiments on multiple real-world datasets. The results show that DPGS yields compact representation that preserves the essential properties of the original graph.