Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
91works
0followers
17topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

91 published item(s)

preprint2026arXiv

Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks

Multimodal recommender systems exploit visual and textual signals to alleviate data sparsity, but this also makes them more vulnerable to evasion-based promotion attacks. Existing defenses are largely limited to single-modal settings and mainly focus on poisoning-based threats, leaving evasion-based threats underexplored. In this work, we first identify a cross-modal gradient mismatch under the multi-user promotion setting, where visual and textual perturbations are optimized in inconsistent directions due to the dominance of distinct user groups. This phenomenon dilutes the attack effectiveness and leads robust training to underestimate worst-case risks. To address this issue, we propose Untargeted Adversarial Training with Multimodal Coordination (UAT-MC). UAT-MC tackles the challenge of unknown targeted items in evasion-based attacks (as opposed to poisoning-based attacks) by treating all items as potential targets, and introduces a gradient alignment mechanism to explicitly correct this mismatch. This design ensures synchronized perturbations across modalities, thereby maximizing adversarial strength for robust training. Extensive experiments demonstrate that UAT-MC significantly improves robustness against promotion attacks while maintaining acceptable recommendation performance under the defense-accuracy trade-off. Code is available at https://github.com/gmXian/UAT-MC.

preprint2026arXiv

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $χ$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

preprint2026arXiv

Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models

Dynamic graphs are ubiquitous in real-world systems, and building generalizable dynamic Graph Foundation Models has become a frontier in graph learning. However, dynamic graphs from different domains pose fundamental challenges to unified modeling, as their semantic and temporal patterns are inherently inconsistent, making the multi-domain pre-training difficult. Consequently, the widely used "pretrain-then-finetune" paradigm often suffers from severe negative knowledge transfer. To the best of our knowledge, there exists no multi-domain dynamic GFM. In this work, we propose DyGFM, a Dynamic Graph Foundation Model over multiple domains based on decoupled and divergence-conditioned prompting. To disentangle transferable semantics from the domain-specific dynamics, we introduce a dual-branch pre-training strategy with semantic-temporal decoupling. To alleviate negative transfer during domain adaptation, we further develop a cross-domain routing mechanism with divergence-aware expert selection. To enable efficient downstream fine-tuning, we design a divergence-conditioned prompt generator that injects lightweight, learnable graph prompts tailored to semantic and temporal traits. Extensive experiments on continuous dynamic graph benchmarks demonstrate that DyGFM consistently outperforms 12 state-of-the-art baselines on both node classification and link prediction tasks, achieving superior effectiveness and efficiency.

preprint2026arXiv

Effective and Unsupervised Social Event Detection and Evolution via RAG and Structural Entropy

With the growing scale of social media, social event detection and evolution modeling have attracted increasing attention. Graph neural networks (GNNs) and transformer-based pre-trained language models (PLMs) have become mainstream approaches in this area. However, existing methods still face three major challenges. First, the sheer volume of social media messages makes learning resource-intensive. Second, the fragmentation of social media messages often impedes the model's ability to capture a comprehensive view of the events. Third, the lack of structured temporal context has hindered the development of effective models for event evolution, limiting users' access to event information. To address these challenges, we propose a foundation model for unsupervised Social Event Detection and Evolution, namely RagSEDE. Specifically, RagSEDE introduces a representativeness- and diversity-driven sampling strategy to extract key messages from massive social streams, significantly reducing noise and computational overhead. It further establishes a novel paradigm based on Retrieval Augmented Generation (RAG) that enhances PLMs in detecting events while simultaneously constructing and maintaining an evolving event knowledge base. Finally, RagSEDE leverages structural information theory to dynamically model event evolution keywords for the first time. Extensive experiments on two public datasets demonstrate the superiority of RagSEDE in open-world social event detection and evolution.

preprint2026arXiv

From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation--evaluation--feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textsc{QuantEvolver}, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textsc{QuantEvolver} converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textsc{QuantEvolver} constructs high-quality seed factors, builds diverse seed--time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textsc{QuantEvolver}, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.

preprint2026arXiv

Manifold-Constrained Adversarial Training for Long-Tailed Robustness via Geometric Alignment

Adversarial training is effective on balanced datasets, but its robustness degrades under longtailed class distributions, where tail classes suffer high robust error and unstable decision boundaries. We propose Manifold-Constrained Adversarial Training (MCAT), a unified framework that enforces the semantic validity of adversarial examples by penalizing deviations from class-conditional manifolds in feature space, while promoting balanced geometric separation across classes via an ETF-inspired regularization. We provide theoretical results that link geometric separation to lower bounds on adversarially robust margins, and show that manifold-constrained adversarial risk upperbounds robust risk on high-density semantic regions. Extensive experiments on standard longtailed benchmarks demonstrate consistent improvements in overall, balanced, and tail-class adversarial robustness.

preprint2026arXiv

Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Best-of-N sampling is a powerful method for improving Large Language Model (LLM) performance, but it is often limited by its dependence on massive, text-based reward models. These models are not only computationally expensive but also data-hungry, requiring extensive labeled datasets for training. This creates a significant data challenge, as they overlook a rich, readily available data source: the LLM's own internal hidden states. To address this data and efficiency gap, we introduce SWIFT (Simple Weighted Intrinsic Feedback Technique), a novel and lightweight method that learns a reward function directly from the rich information embedded in LLM hidden states. Operating at the token embedding level, SWIFT employs simple linear layers to effectively distinguish between preferred and dispreferred generations, eliminating the need for computationally intensive text-based modeling. Extensive experiments on standard benchmarks show that SWIFT outperforms existing baselines (12.7% higher accuracy than EurusRM-7B on MATH dataset) while using less than 0.005% of their parameters. Its robust scalability, compatibility with certain closed-source models via logit access, and ability to combine with traditional reward models for additional performance highlight SWIFT's practical value and contribution to more efficient data-driven LLM post-training. Our code is available at https://github.com/aster2024/SWIFT .

preprint2026arXiv

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.

preprint2026arXiv

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

preprint2026arXiv

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

Reinforcement fine-tuning (RFT) has become a core paradigm for post-training large language models, yet its training process remains highly fragile. Existing efforts mainly improve reliability at the system level or address specific issues in individual subproblems by modifying RFT algorithms. Despite their effectiveness, they largely overlook the problem of failure management at the training-process level. When training goes wrong, practitioners still rely heavily on expert-driven manual inspection and correction, and automatic failure management for RFT remains largely unexplored. In this paper, we take a first step toward systematic failure management for reinforcement fine-tuning. To understand the empirical structure of RFT failures, we first construct RFT-FaultBench, the first benchmark for fine-grained failures in reinforcement fine-tuning, covering 5 fault families, 16 fault types, 779 training runs, 22,549 train-step records, and 1,457,288 trajectory-level records. Based on this benchmark, we conduct a comprehensive empirical study showing that RFT failures are both observable from training dynamics and distinguishable through their empirical fault fingerprints. Building on these findings, we propose RFT-FM, an automatic failure management framework for reinforcement fine-tuning that unifies anomaly detection, failure diagnosis, and auto remediation in a closed loop. Experimental results show that RFT-FaultBench is neither trivial nor saturated: it exhibits clear anomaly structure while still posing substantial challenges, especially under subtle fault settings. Moreover, RFT-FM shows strong capability in detecting, diagnosing, and mitigating RFT failures.

preprint2025arXiv

Enhancing Temporal Awareness in LLMs for Temporal Point Processes

Temporal point processes (TPPs) are crucial for analyzing events over time and are widely used in fields such as finance, healthcare, and social systems. These processes are particularly valuable for understanding how events unfold over time, accounting for their irregularity and dependencies. Despite the success of large language models (LLMs) in sequence modeling, applying them to temporal point processes remains challenging. A key issue is that current methods struggle to effectively capture the complex interaction between temporal information and semantic context, which is vital for accurate event modeling. In this context, we introduce TPP-TAL (Temporal Point Processes with Enhanced Temporal Awareness in LLMs), a novel plug-and-play framework designed to enhance temporal reasoning within LLMs. Rather than using the conventional method of simply concatenating event time and type embeddings, TPP-TAL explicitly aligns temporal dynamics with contextual semantics before feeding this information into the LLM. This alignment allows the model to better perceive temporal dependencies and long-range interactions between events and their surrounding contexts. Through comprehensive experiments on several benchmark datasets, it is shown that TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy, highlighting the importance of enhancing temporal awareness in LLMs for continuous-time event modeling. The code is made available at https://github.com/chenlilil/TPP-TAL

preprint2024arXiv

Contrastive Sequential Interaction Network Learning on Co-Evolving Riemannian Spaces

The sequential interaction network usually find itself in a variety of applications, e.g., recommender system. Herein, inferring future interaction is of fundamental importance, and previous efforts are mainly focused on the dynamics in the classic zero-curvature Euclidean space. Despite the promising results achieved by previous methods, a range of significant issues still largely remains open: On the bipartite nature, is it appropriate to place user and item nodes in one identical space regardless of their inherent difference? On the network dynamics, instead of a fixed curvature space, will the representation spaces evolve when new interactions arrive continuously? On the learning paradigm, can we get rid of the label information costly to acquire? To address the aforementioned issues, we propose a novel Contrastive model for Sequential Interaction Network learning on Co-Evolving RiEmannian spaces, CSINCERE. To the best of our knowledge, we are the first to introduce a couple of co-evolving representation spaces, rather than a single or static space, and propose a co-contrastive learning for the sequential interaction network. In CSINCERE, we formulate a Cross-Space Aggregation for message-passing across representation spaces of different Riemannian geometries, and design a Neural Curvature Estimator based on Ricci curvatures for modeling the space evolvement over time. Thereafter, we present a Reweighed Co-Contrast between the temporal views of the sequential network, so that the couple of Riemannian spaces interact with each other for the interaction prediction without labels. Empirical results on 5 public datasets show the superiority of CSINCERE over the state-of-the-art methods.

preprint2023arXiv

A Survey on Evaluation of Large Language Models

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

preprint2023arXiv

Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit

Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora, with the aim of developing intelligent tools to improve the quality and productivity of computer programming. Currently, there is already a thriving research community focusing on code intelligence, with efforts ranging from software engineering, machine learning, data mining, natural language processing, and programming languages. In this paper, we conduct a comprehensive literature review on deep learning for code intelligence, from the aspects of code representation learning, deep learning techniques, and application tasks. We also benchmark several state-of-the-art neural models for code intelligence, and provide an open-source toolkit tailored for the rapid prototyping of deep-learning-based code intelligence models. In particular, we inspect the existing code intelligence models under the basis of code representation learning, and provide a comprehensive overview to enhance comprehension of the present state of code intelligence. Furthermore, we publicly release the source code and data resources to provide the community with a ready-to-use benchmark, which can facilitate the evaluation and comparison of existing and future code intelligence models (https://xcodemind.github.io). At last, we also point out several challenging and promising directions for future research.

preprint2023arXiv

TI-CNN: Convolutional Neural Networks for Fake News Detection

With the development of social networks, fake news for various commercial and political purposes has been appearing in large numbers and gotten widespread in the online world. With deceptive words, people can get infected by the fake news very easily and will share them without any fact-checking. For instance, during the 2016 US president election, various kinds of fake news about the candidates widely spread through both official news media and the online social networks. These fake news is usually released to either smear the opponents or support the candidate on their side. The erroneous information in the fake news is usually written to motivate the voters' irrational emotion and enthusiasm. Such kinds of fake news sometimes can bring about devastating effects, and an important goal in improving the credibility of online social networks is to identify the fake news timely. In this paper, we propose to study the fake news detection problem. Automatic fake news identification is extremely hard, since pure model based fact-checking for news is still an open problem, and few existing models can be applied to solve the problem. With a thorough investigation of a fake news data, lots of useful explicit features are identified from both the text words and images used in the fake news. Besides the explicit features, there also exist some hidden patterns in the words and images used in fake news, which can be captured with a set of latent features extracted via the multiple convolutional layers in our model. A model named as TI-CNN (Text and Image information based Convolutinal Neural Network) is proposed in this paper. By projecting the explicit and latent features into a unified feature space, TI-CNN is trained with both the text and image information simultaneously. Extensive experiments carried on the real-world fake news datasets have demonstrate the effectiveness of TI-CNN.

preprint2023arXiv

Time-aware Hyperbolic Graph Attention Network for Session-based Recommendation

Session-based Recommendation (SBR) is to predict users' next interested items based on their previous browsing sessions. Existing methods model sessions as graphs or sequences to estimate user interests based on their interacted items to make recommendations. In recent years, graph-based methods have achieved outstanding performance on SBR. However, none of these methods consider temporal information, which is a crucial feature in SBR as it indicates timeliness or currency. Besides, the session graphs exhibit a hierarchical structure and are demonstrated to be suitable in hyperbolic geometry. But few papers design the models in hyperbolic spaces and this direction is still under exploration. In this paper, we propose Time-aware Hyperbolic Graph Attention Network (TA-HGAT) - a novel hyperbolic graph neural network framework to build a session-based recommendation model considering temporal information. More specifically, there are three components in TA-HGAT. First, a hyperbolic projection module transforms the item features into hyperbolic space. Second, the time-aware graph attention module models time intervals between items and the users' current interests. Third, an evolutionary loss at the end of the model provides an accurate prediction of the recommended item based on the given timestamp. TA-HGAT is built in a hyperbolic space to learn the hierarchical structure of session graphs. Experimental results show that the proposed TA-HGAT has the best performance compared to ten baseline models on two real-world datasets.

preprint2022arXiv

A Generic Algorithm for Top-K On-Shelf Utility Mining

On-shelf utility mining (OSUM) is an emerging research direction in data mining. It aims to discover itemsets that have high relative utility in their selling time period. Compared with traditional utility mining, OSUM can find more practical and meaningful patterns in real-life applications. However, there is a major drawback to traditional OSUM. For normal users, it is hard to define a minimum threshold minutil for mining the right amount of on-shelf high utility itemsets. On one hand, if the threshold is set too high, the number of patterns would not be enough. On the other hand, if the threshold is set too low, too many patterns will be discovered and cause an unnecessary waste of time and memory consumption. To address this issue, the user usually directly specifies a parameter k, where only the top-k high relative utility itemsets would be considered. Therefore, in this paper, we propose a generic algorithm named TOIT for mining Top-k On-shelf hIgh-utility paTterns to solve this problem. TOIT applies a novel strategy to raise the minutil based on the on-shelf datasets. Besides, two novel upper-bound strategies named subtree utility and local utility are applied to prune the search space. By adopting the strategies mentioned above, the TOIT algorithm can narrow the search space as early as possible, improve the mining efficiency, and reduce the memory consumption, so it can obtain better performance than other algorithms. A series of experiments have been conducted on real datasets with different styles to compare the effects with the state-of-the-art KOSHU algorithm. The experimental results showed that TOIT outperforms KOSHU in both running time and memory consumption.

preprint2022arXiv

A Multi-level Supervised Contrastive Learning Framework for Low-Resource Natural Language Inference

Natural Language Inference (NLI) is a growingly essential task in natural language understanding, which requires inferring the relationship between the sentence pairs (premise and hypothesis). Recently, low-resource natural language inference has gained increasing attention, due to significant savings in manual annotation costs and a better fit with real-world scenarios. Existing works fail to characterize discriminative representations between different classes with limited training data, which may cause faults in label prediction. Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference. MultiSCL leverages a sentence-level and pair-level contrastive learning objective to discriminate between different classes of sentence pairs by bringing those in one class together and pushing away those in different classes. MultiSCL adopts a data augmentation module that generates different views for input samples to better learn the latent representation. The pair-level representation is obtained from a cross attention module. We conduct extensive experiments on two public NLI datasets in low-resource settings, and the accuracy of MultiSCL exceeds other models by 3.1% on average. Moreover, our method outperforms the previous state-of-the-art method on cross-domain tasks of text classification.

preprint2022arXiv

A Self-supervised Riemannian GNN with Time Varying Curvature for Temporal Graph Learning

Representation learning on temporal graphs has drawn considerable research attention owing to its fundamental importance in a wide spectrum of real-world applications. Though a number of studies succeed in obtaining time-dependent representations, it still faces significant challenges. On the one hand, most of the existing methods restrict the embedding space with a certain curvature. However, the underlying geometry in fact shifts among the positive curvature hyperspherical, zero curvature Euclidean and negative curvature hyperbolic spaces in the evolvement over time. On the other hand, these methods usually require abundant labels to learn temporal representations, and thereby notably limit their wide use in the unlabeled graphs of the real applications. To bridge this gap, we make the first attempt to study the problem of self-supervised temporal graph representation learning in the general Riemannian space, supporting the time-varying curvature to shift among hyperspherical, Euclidean and hyperbolic spaces. In this paper, we present a novel self-supervised Riemannian graph neural network (SelfRGNN). Specifically, we design a curvature-varying Riemannian GNN with a theoretically grounded time encoding, and formulate a functional curvature over time to model the evolvement shifting among the positive, zero and negative curvature spaces. To enable the self-supervised learning, we propose a novel reweighting self-contrastive approach, exploring the Riemannian space itself without augmentation, and propose an edge-based self-supervised curvature learning with the Ricci curvature. Extensive experiments show the superiority of SelfRGNN, and moreover, the case study shows the time-varying curvature of temporal graph in reality.

preprint2022arXiv

Are Pretrained Transformers Robust in Intent Classification? A Missing Ingredient in Evaluation of Out-of-Scope Intent Detection

Pre-trained Transformer-based models were reported to be robust in intent classification. In this work, we first point out the importance of in-domain out-of-scope detection in few-shot intent recognition tasks and then illustrate the vulnerability of pre-trained Transformer-based models against samples that are in-domain but out-of-scope (ID-OOS). We construct two new datasets, and empirically show that pre-trained models do not perform well on both ID-OOS examples and general out-of-scope examples, especially on fine-grained few-shot intent detection tasks. To figure out how the models mistakenly classify ID-OOS intents as in-scope intents, we further conduct analysis on confidence scores and the overlapping keywords, as well as point out several prospective directions for future work. Resources are available on https://github.com/jianguoz/Few-Shot-Intent-Detection.

preprint2022arXiv

Attend, Memorize and Generate: Towards Faithful Table-to-Text Generation in Few Shots

Few-shot table-to-text generation is a task of composing fluent and faithful sentences to convey table content using limited data. Despite many efforts having been made towards generating impressive fluent sentences by fine-tuning powerful pre-trained language models, the faithfulness of generated content still needs to be improved. To this end, this paper proposes a novel approach Attend, Memorize and Generate (called AMG), inspired by the text generation process of humans. In particular, AMG (1) attends over the multi-granularity of context using a novel strategy based on table slot level and traditional token-by-token level attention to exploit both the table structure and natural linguistic information; (2) dynamically memorizes the table slot allocation states; and (3) generates faithful sentences according to both the context and memory allocation states. Comprehensive experiments with human evaluation on three domains (i.e., humans, songs, and books) of the Wiki dataset show that our model can generate higher qualified texts when compared with several state-of-the-art baselines, in both fluency and faithfulness.

preprint2022arXiv

Automating DBSCAN via Deep Reinforcement Learning

DBSCAN is widely used in many scientific and engineering fields because of its simplicity and practicality. However, due to its high sensitivity parameters, the accuracy of the clustering result depends heavily on practical experience. In this paper, we first propose a novel Deep Reinforcement Learning guided automatic DBSCAN parameters search framework, namely DRL-DBSCAN. The framework models the process of adjusting the parameter search direction by perceiving the clustering environment as a Markov decision process, which aims to find the best clustering parameters without manual assistance. DRL-DBSCAN learns the optimal clustering parameter search policy for different feature distributions via interacting with the clusters, using a weakly-supervised reward training policy network. In addition, we also present a recursive search mechanism driven by the scale of the data to efficiently and controllably process large parameter spaces. Extensive experiments are conducted on five artificial and real-world datasets based on the proposed four working modes. The results of offline and online tasks show that the DRL-DBSCAN not only consistently improves DBSCAN clustering accuracy by up to 26% and 25% respectively, but also can stably find the dominant parameters with high computational efficiency. The code is available at https://github.com/RingBDStack/DRL-DBSCAN.

preprint2022arXiv

CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking

The explosion of misinformation spreading in the media ecosystem urges for automated fact-checking. While misinformation spans both geographic and linguistic boundaries, most work in the field has focused on English. Datasets and tools available in other languages, such as Chinese, are limited. In order to bridge this gap, we construct CHEF, the first CHinese Evidence-based Fact-checking dataset of 10K real-world claims. The dataset covers multiple domains, ranging from politics to public health, and provides annotated evidence retrieved from the Internet. Further, we develop established baselines and a novel approach that is able to model the evidence retrieval as a latent variable, allowing jointly training with the veracity prediction model in an end-to-end fashion. Extensive experiments show that CHEF will provide a challenging testbed for the development of fact-checking systems designed to retrieve and reason over non-English claims.

preprint2022arXiv

Collaborative Knowledge Graph Fusion by Exploiting the Open Corpus

To alleviate the challenges of building Knowledge Graphs (KG) from scratch, a more general task is to enrich a KG using triples from an open corpus, where the obtained triples contain noisy entities and relations. It is challenging to enrich a KG with newly harvested triples while maintaining the quality of the knowledge representation. This paper proposes a system to refine a KG using information harvested from an additional corpus. To this end, we formulate our task as two coupled sub-tasks, namely join event extraction (JEE) and knowledge graph fusion (KGF). We then propose a Collaborative Knowledge Graph Fusion Framework to allow our sub-tasks to mutually assist one another in an alternating manner. More concretely, the explorer carries out the JEE supervised by both the ground-truth annotation and an existing KG provided by the supervisor. The supervisor then evaluates the triples extracted by the explorer and enriches the KG with those that are highly ranked. To implement this evaluation, we further propose a Translated Relation Alignment Scoring Mechanism to align and translate the extracted triples to the prior KG. Experiments verify that this collaboration can both improve the performance of the JEE and the KGF.

preprint2022arXiv

Deep learning for drug repurposing: methods, databases, and applications

Drug development is time-consuming and expensive. Repurposing existing drugs for new therapies is an attractive solution that accelerates drug development at reduced experimental costs, specifically for Coronavirus Disease 2019 (COVID-19), an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, comprehensively obtaining and productively integrating available knowledge and big biomedical data to effectively advance deep learning models is still challenging for drug repurposing in other complex diseases. In this review, we introduce guidelines on how to utilize deep learning methodologies and tools for drug repurposing. We first summarized the commonly used bioinformatics and pharmacogenomics databases for drug repurposing. Next, we discuss recently developed sequence-based and graph-based representation approaches as well as state-of-the-art deep learning-based methods. Finally, we present applications of drug repurposing to fight the COVID-19 pandemic, and outline its future challenges.

preprint2022arXiv

Dual Space Graph Contrastive Learning

Unsupervised graph representation learning has emerged as a powerful tool to address real-world problems and achieves huge success in the graph learning domain. Graph contrastive learning is one of the unsupervised graph representation learning methods, which recently attracts attention from researchers and has achieved state-of-the-art performances on various tasks. The key to the success of graph contrastive learning is to construct proper contrasting pairs to acquire the underlying structural semantics of the graph. However, this key part is not fully explored currently, most of the ways generating contrasting pairs focus on augmenting or perturbating graph structures to obtain different views of the input graph. But such strategies could degrade the performances via adding noise into the graph, which may narrow down the field of the applications of graph contrastive learning. In this paper, we propose a novel graph contrastive learning method, namely \textbf{D}ual \textbf{S}pace \textbf{G}raph \textbf{C}ontrastive (DSGC) Learning, to conduct graph contrastive learning among views generated in different spaces including the hyperbolic space and the Euclidean space. Since both spaces have their own advantages to represent graph data in the embedding spaces, we hope to utilize graph contrastive learning to bridge the spaces and leverage advantages from both sides. The comparison experiment results show that DSGC achieves competitive or better performances among all the datasets. In addition, we conduct extensive experiments to analyze the impact of different graph encoders on DSGC, giving insights about how to better leverage the advantages of contrastive learning between different spaces.

preprint2022arXiv

Evidential Temporal-aware Graph-based Social Event Detection via Dempster-Shafer Theory

The rising popularity of online social network services has attracted lots of research on mining social media data, especially on mining social events. Social event detection, due to its wide applications, has now become a trivial task. State-of-the-art approaches exploiting Graph Neural Networks (GNNs) usually follow a two-step strategy: 1) constructing text graphs based on various views (\textit{co-user}, \textit{co-entities} and \textit{co-hashtags}); and 2) learning a unified text representation by a specific GNN model. Generally, the results heavily rely on the quality of the constructed graphs and the specific message passing scheme. However, existing methods have deficiencies in both aspects: 1) They fail to recognize the noisy information induced by unreliable views. 2) Temporal information which works as a vital indicator of events is neglected in most works. To this end, we propose ETGNN, a novel Evidential Temporal-aware Graph Neural Network. Specifically, we construct view-specific graphs whose nodes are the texts and edges are determined by several types of shared elements respectively. To incorporate temporal information into the message passing scheme, we introduce a novel temporal-aware aggregator which assigns weights to neighbours according to an adaptive time exponential decay formula. Considering the view-specific uncertainty, the representations of all views are converted into mass functions through evidential deep learning (EDL) neural networks, and further combined via Dempster-Shafer theory (DST) to make the final detection. Experimental results on three real-world datasets demonstrate the effectiveness of ETGNN in accuracy, reliability and robustness in social event detection.

preprint2022arXiv

From Known to Unknown: Quality-aware Self-improving Graph Neural Network for Open Set Social Event Detection

State-of-the-art Graph Neural Networks (GNNs) have achieved tremendous success in social event detection tasks when restricted to a closed set of events. However, considering the large amount of data needed for training a neural network and the limited ability of a neural network in handling previously unknown data, it remains a challenge for existing GNN-based methods to operate in an open set setting. To address this problem, we design a Quality-aware Self-improving Graph Neural Network (QSGNN) which extends the knowledge from known to unknown by leveraging the best of known samples and reliable knowledge transfer. Specifically, to fully exploit the labeled data, we propose a novel supervised pairwise loss with an additional orthogonal inter-class relation constraint to train the backbone GNN encoder. The learnt, already-known events further serve as strong reference bases for the unknown ones, which greatly prompts knowledge acquisition and transfer. When the model is generalized to unknown data, to ensure the effectiveness and reliability, we further leverage the reference similarity distribution vectors for pseudo pairwise label generation, selection and quality assessment. Following the diversity principle of active learning, our method selects diverse pair samples with the generated pseudo labels to fine-tune the GNN encoder. Besides, we propose a novel quality-guided optimization in which the contributions of pseudo labels are weighted based on consistency. We thoroughly evaluate our model on two large real-world social event datasets. Experiments demonstrate that our model achieves state-of-the-art results and extends well to unknown events.

preprint2022arXiv

G$^3$SR: Global Graph Guided Session-based Recommendation

Session-based recommendation tries to make use of anonymous session data to deliver high-quality recommendation under the condition that user-profiles and the complete historical behavioral data of a target user are unavailable. Previous works consider each session individually and try to capture user interests within a session. Despite their encouraging results, these models can only perceive intra-session items and cannot draw upon the massive historical relational information. To solve this problem, we propose a novel method named G$^3$SR (Global Graph Guided Session-based Recommendation). G$^3$SR decomposes the session-based recommendation workflow into two steps. First, a global graph is built upon all session data, from which the global item representations are learned in an unsupervised manner. Then, these representations are refined on session graphs under the graph networks, and a readout function is used to generate session representations for each session. Extensive experiments on two real-world benchmark datasets show remarkable and consistent improvements of the G$^3$SR method over the state-of-the-art methods, especially for cold items.

preprint2022arXiv

Generalizing to Unseen Domains: A Survey on Domain Generalization

Machine learning systems generally assume that the training and testing distributions are the same. To this end, a key requirement is to develop models that can generalize to unseen distributions. Domain generalization (DG), i.e., out-of-distribution generalization, has attracted increasing interests in recent years. Domain generalization deals with a challenging setting where one or several different but related domain(s) are given, and the goal is to learn a model that can generalize to an unseen test domain. Great progress has been made in the area of domain generalization for years. This paper presents the first review of recent advances in this area. First, we provide a formal definition of domain generalization and discuss several related fields. We then thoroughly review the theories related to domain generalization and carefully analyze the theory behind generalization. We categorize recent algorithms into three classes: data manipulation, representation learning, and learning strategy, and present several popular algorithms in detail for each category. Third, we introduce the commonly used datasets, applications, and our open-sourced codebase for fair evaluation. Finally, we summarize existing literature and present some potential research topics for the future.

preprint2022arXiv

Graph Masked Autoencoders with Transformers

Recently, transformers have shown promising performance in learning graph representations. However, there are still some challenges when applying transformers to real-world scenarios due to the fact that deep transformers are hard to train from scratch and the quadratic memory consumption w.r.t. the number of nodes. In this paper, we propose Graph Masked Autoencoders (GMAEs), a self-supervised transformer-based model for learning graph representations. To address the above two challenges, we adopt the masking mechanism and the asymmetric encoder-decoder design. Specifically, GMAE takes partially masked graphs as input, and reconstructs the features of the masked nodes. The encoder and decoder are asymmetric, where the encoder is a deep transformer and the decoder is a shallow transformer. The masking mechanism and the asymmetric design make GMAE a memory-efficient model compared with conventional transformers. We show that, when serving as a conventional self-supervised graph representation model, GMAE achieves state-of-the-art performance on both the graph classification task and the node classification task under common downstream evaluation protocols. We also show that, compared with training in an end-to-end manner from scratch, we can achieve comparable performance after pre-training and fine-tuning using GMAE while simplifying the training process.

preprint2022arXiv

Graph Self-Supervised Learning: A Survey

Deep learning on graphs has attracted significant interests recently. However, most of the works have focused on (semi-) supervised learning, resulting in shortcomings including heavy label reliance, poor generalization, and weak robustness. To address these issues, self-supervised learning (SSL), which extracts informative knowledge through well-designed pretext tasks without relying on manual labels, has become a promising and trending learning paradigm for graph data. Different from SSL on other domains like computer vision and natural language processing, SSL on graphs has an exclusive background, design ideas, and taxonomies. Under the umbrella of graph self-supervised learning, we present a timely and comprehensive review of the existing approaches which employ SSL techniques for graph data. We construct a unified framework that mathematically formalizes the paradigm of graph SSL. According to the objectives of pretext tasks, we divide these approaches into four categories: generation-based, auxiliary property-based, contrast-based, and hybrid approaches. We further describe the applications of graph SSL across various research fields and summarize the commonly used datasets, evaluation benchmark, performance comparison and open-source codes of graph SSL. Finally, we discuss the remaining challenges and potential future directions in this research field.

preprint2022arXiv

HUSP-SP: Faster Utility Mining on Sequence Data

High-utility sequential pattern mining (HUSPM) has emerged as an important topic due to its wide application and considerable popularity. However, due to the combinatorial explosion of the search space when the HUSPM problem encounters a low utility threshold or large-scale data, it may be time-consuming and memory-costly to address the HUSPM problem. Several algorithms have been proposed for addressing this problem, but they still cost a lot in terms of running time and memory usage. In this paper, to further solve this problem efficiently, we design a compact structure called sequence projection (seqPro) and propose an efficient algorithm, namely discovering high-utility sequential patterns with the seqPro structure (HUSP-SP). HUSP-SP utilizes the compact seq-array to store the necessary information in a sequence database. The seqPro structure is designed to efficiently calculate candidate patterns' utilities and upper bound values. Furthermore, a new upper bound on utility, namely tighter reduced sequence utility (TRSU) and two pruning strategies in search space, are utilized to improve the mining performance of HUSP-SP. Experimental results on both synthetic and real-life datasets show that HUSP-SP can significantly outperform the state-of-the-art algorithms in terms of running time, memory usage, search space pruning efficiency, and scalability.

preprint2022arXiv

Improving Contrastive Learning with Model Augmentation

The sequential recommendation aims at predicting the next items in user behaviors, which can be solved by characterizing item relationships in sequences. Due to the data sparsity and noise issues in sequences, a new self-supervised learning (SSL) paradigm is proposed to improve the performance, which employs contrastive learning between positive and negative views of sequences. However, existing methods all construct views by adopting augmentation from data perspectives, while we argue that 1) optimal data augmentation methods are hard to devise, 2) data augmentation methods destroy sequential correlations, and 3) data augmentation fails to incorporate comprehensive self-supervised signals. Therefore, we investigate the possibility of model augmentation to construct view pairs. We propose three levels of model augmentation methods: neuron masking, layer dropping, and encoder complementing. This work opens up a novel direction in constructing views for contrastive SSL. Experiments verify the efficacy of model augmentation for the SSL in the sequential recommendation. Code is available\footnote{\url{https://github.com/salesforce/SRMA}}.

preprint2022arXiv

Large-scale Personalized Video Game Recommendation via Social-aware Contextualized Graph Neural Network

Because of the large number of online games available nowadays, online game recommender systems are necessary for users and online game platforms. The former can discover more potential online games of their interests, and the latter can attract users to dwell longer in the platform. This paper investigates the characteristics of user behaviors with respect to the online games on the Steam platform. Based on the observations, we argue that a satisfying recommender system for online games is able to characterize: personalization, game contextualization and social connection. However, simultaneously solving all is rather challenging for game recommendation. Firstly, personalization for game recommendation requires the incorporation of the dwelling time of engaged games, which are ignored in existing methods. Secondly, game contextualization should reflect the complex and high-order properties of those relations. Last but not least, it is problematic to use social connections directly for game recommendations due to the massive noise within social connections. To this end, we propose a Social-aware Contextualized Graph Neural Recommender System (SCGRec), which harnesses three perspectives to improve game recommendation. We conduct a comprehensive analysis of users' online game behaviors, which motivates the necessity of handling those three characteristics in the online game recommendation.

preprint2022arXiv

Learning from Atypical Behavior: Temporary Interest Aware Recommendation Based on Reinforcement Learning

Traditional robust recommendation methods view atypical user-item interactions as noise and aim to reduce their impact with some kind of noise filtering technique, which often suffers from two challenges. First, in real world, atypical interactions may signal users' temporary interest different from their general preference. Therefore, simply filtering out the atypical interactions as noise may be inappropriate and degrade the personalization of recommendations. Second, it is hard to acquire the temporary interest since there are no explicit supervision signals to indicate whether an interaction is atypical or not. To address this challenges, we propose a novel model called Temporary Interest Aware Recommendation (TIARec), which can distinguish atypical interactions from normal ones without supervision and capture the temporary interest as well as the general preference of users. Particularly, we propose a reinforcement learning framework containing a recommender agent and an auxiliary classifier agent, which are jointly trained with the objective of maximizing the cumulative return of the recommendations made by the recommender agent. During the joint training process, the classifier agent can judge whether the interaction with an item recommended by the recommender agent is atypical, and the knowledge about learning temporary interest from atypical interactions can be transferred to the recommender agent, which makes the recommender agent able to alone make recommendations that balance the general preference and temporary interest of users. At last, the experiments conducted on real world datasets verify the effectiveness of TIARec.

preprint2022arXiv

Link Prediction with Contextualized Self-Supervision

Link prediction aims to infer the link existence between pairs of nodes in networks/graphs. Despite their wide application, the success of traditional link prediction algorithms is hindered by three major challenges -- link sparsity, node attribute noise and dynamic changes -- that are faced by many real-world networks. To address these challenges, we propose a Contextualized Self-Supervised Learning (CSSL) framework that fully exploits structural context prediction for link prediction. The proposed CSSL framework learns a link encoder to infer the link existence probability from paired node embeddings, which are constructed via a transformation on node attributes. To generate informative node embeddings for link prediction, structural context prediction is leveraged as a self-supervised learning task to boost the link prediction performance. Two types of structural context are investigated, i.e., context nodes collected from random walks vs. context subgraphs. The CSSL framework can be trained in an end-to-end manner, with the learning of model parameters supervised by both the link prediction and self-supervised learning tasks. The proposed CSSL is a generic and flexible framework in the sense that it can handle both attributed and non-attributed networks, and operate under both transductive and inductive link prediction settings. Extensive experiments and ablation studies on seven real-world benchmark networks demonstrate the superior performance of the proposed self-supervision based link prediction algorithm over state-of-the-art baselines, on different types of networks under both transductive and inductive settings. The proposed CSSL also yields competitive performance in terms of its robustness to node attribute noise and scalability over large-scale networks.

preprint2022arXiv

Membership Inference Attacks on Knowledge Graphs

Membership inference attacks (MIAs) infer whether a specific data record is used for target model training. MIAs have provoked many discussions in the information security community since they give rise to severe data privacy issues, especially for private and sensitive datasets. Knowledge Graphs (KGs), which describe domain-specific subjects and relationships among them, are valuable and sensitive, such as medical KGs constructed from electronic health records. However, the privacy threat to knowledge graphs is critical but rarely explored. In this paper, we conduct the first empirical evaluation of privacy threats to knowledge graphs triggered by knowledge graph embedding methods (KGEs). We propose three types of membership inference attacks: transfer attacks (TAs), prediction loss-based attacks (PLAs), and prediction correctness-based attacks (PCAs), according to attack difficulty levels. In the experiments, we conduct three inference attacks against four standard KGE methods over three benchmark datasets. In addition, we also propose the attacks against medical KG and financial KG. The results demonstrate that the proposed attack methods can easily explore the privacy leakage of knowledge graphs.

preprint2022arXiv

Membership Inference Attacks on Machine Learning: A Survey

Machine learning (ML) models have been widely applied to various applications, including image classification, text generation, audio recognition, and graph data analysis. However, recent studies have shown that ML models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. MIAs on ML models can directly lead to a privacy breach. For example, via identifying the fact that a clinical record that has been used to train a model associated with a certain disease, an attacker can infer that the owner of the clinical record has the disease with a high chance. In recent years, MIAs have been shown to be effective on various ML models, e.g., classification models and generative models. Meanwhile, many defense methods have been proposed to mitigate MIAs. Although MIAs on ML models form a newly emerging and rapidly growing research area, there has been no systematic survey on this topic yet. In this paper, we conduct the first comprehensive survey on membership inference attacks and defenses. We provide the taxonomies for both attacks and defenses, based on their characterizations, and discuss their pros and cons. Based on the limitations and gaps identified in this survey, we point out several promising future research directions to inspire the researchers who wish to follow this area. This survey not only serves as a reference for the research community but also provides a clear description for researchers outside this research domain. To further help the researchers, we have created an online resource repository, which we will keep updated with future relevant work. Interested readers can find the repository at https://github.com/HongshengHu/membership-inference-machine-learning-literature.

preprint2022arXiv

Multi-Sparse-Domain Collaborative Recommendation via Enhanced Comprehensive Aspect Preference Learning

Cross-domain recommendation (CDR) has been attracting increasing attention of researchers for its ability to alleviate the data sparsity problem in recommender systems. However, the existing single-target or dual-target CDR methods often suffer from two drawbacks, the assumption of at least one rich domain and the heavy dependence on domain-invariant preference, which are impractical in real world where sparsity is ubiquitous and might degrade the user preference learning. To overcome these issues, we propose a Multi-Sparse-Domain Collaborative Recommendation (MSDCR) model for multi-target cross-domain recommendation. Unlike traditional CDR methods, MSDCR treats the multiple relevant domains as all sparse and can simultaneously improve the recommendation performance in each domain. We propose a Multi-Domain Separation Network (MDSN) and a Gated Aspect Preference Enhancement (GAPE) module for MSDCR to enhance a user's domain-specific aspect preferences in a domain by transferring the complementary aspect preferences in other domains, during which the uniqueness of the domain-specific preference can be preserved through the adversarial training offered by MDSN and the complementarity can be adaptively determined by GAPE. Meanwhile, we propose a Multi-Domain Adaptation Network (MDAN) for MSDCR to capture a user's domain-invariant aspect preference. With the integration of the enhanced domain-specific aspect preference and the domain-invariant aspect preference, MSDCR can reach a comprehensive understanding of a user's preference in each sparse domain. At last, the extensive experiments conducted on real datasets demonstrate the remarkable superiority of MSDCR over the state-of-the-art single-domain recommendation models and CDR models.

preprint2022arXiv

Multifaceted Improvements for Conversational Open-Domain Question Answering

Open-domain question answering (OpenQA) is an important branch of textual QA which discovers answers for the given questions based on a large number of unstructured documents. Effectively mining correct answers from the open-domain sources still has a fair way to go. Existing OpenQA systems might suffer from the issues of question complexity and ambiguity, as well as insufficient background knowledge. Recently, conversational OpenQA is proposed to address these issues with the abundant contextual information in the conversation. Promising as it might be, there exist several fundamental limitations including the inaccurate question understanding, the coarse ranking for passage selection, and the inconsistent usage of golden passage in the training and inference phases. To alleviate these limitations, in this paper, we propose a framework with Multifaceted Improvements for Conversational open-domain Question Answering (MICQA). Specifically, MICQA has three significant advantages. First, the proposed KL-divergence based regularization is able to lead to a better question understanding for retrieval and answer reading. Second, the added post-ranker module can push more relevant passages to the top placements and be selected for reader with a two-aspect constrains. Third, the well designed curriculum learning strategy effectively narrows the gap between the golden passage settings of training and inference, and encourages the reader to find true answer without the golden passage assistance. Extensive experiments conducted on the publicly available dataset OR-QuAC demonstrate the superiority of MICQA over the state-of-the-art model in conversational OpenQA task.

preprint2022arXiv

PERFECT: A Hyperbolic Embedding for Joint User and Community Alignment

Social network alignment shows fundamental importance in a wide spectrum of applications. To the best of our knowledge, existing studies mainly focus on network alignment at the individual user level, requiring abundant common information between shared individual users. For the networks that cannot meet such requirements, social community structures actually provide complementary and critical information at a slightly coarse-grained level, alignment of which will provide additional information for user alignment. In turn, user alignment also reveals more clues for community alignment. Hence, in this paper, we introduce the problem of joint social network alignment, which aims to align users and communities across social networks simultaneously. Key challenges lie in that 1) how to learn the representations of both users and communities, and 2) how to make user alignment and community alignment benefit from each other. To address these challenges, we first elaborate on the characteristics of real-world networks with the notion of delta-hyperbolicity, and show the superiority of hyperbolic space for representing social networks. Then, we present a novel hyperbolic embedding approach for the joint social network alignment, referred to as PERFECT, in a unified optimization. Extensive experiments on real-world datasets show the superiority of PERFECT in both user alignment and community alignment.

preprint2022arXiv

Position-aware Structure Learning for Graph Topology-imbalance by Relieving Under-reaching and Over-squashing

Topology-imbalance is a graph-specific imbalance problem caused by the uneven topology positions of labeled nodes, which significantly damages the performance of GNNs. What topology-imbalance means and how to measure its impact on graph learning remain under-explored. In this paper, we provide a new understanding of topology-imbalance from a global view of the supervision information distribution in terms of under-reaching and over-squashing, which motivates two quantitative metrics as measurements. In light of our analysis, we propose a novel position-aware graph structure learning framework named PASTEL, which directly optimizes the information propagation path and solves the topology-imbalance issue in essence. Our key insight is to enhance the connectivity of nodes within the same class for more supervision information, thereby relieving the under-reaching and over-squashing phenomena. Specifically, we design an anchor-based position encoding mechanism, which better incorporates relative topology position and enhances the intra-class inductive bias by maximizing the label influence. We further propose a class-wise conflict measure as the edge weights, which benefits the separation of different node classes. Extensive experiments demonstrate the superior potential and adaptability of PASTEL in enhancing GNNs' power in different data annotation scenarios.

preprint2022arXiv

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning

The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical context, where the visual dynamics are believed to have modular structures that can be learned with compositional subsystems. This paper models these structures by presenting PredRNN, a new recurrent network, in which a pair of memory cells are explicitly decoupled, operate in nearly independent transition manners, and finally form unified representations of the complex environment. Concretely, besides the original memory cell of LSTM, this network is featured by a zigzag memory flow that propagates in both bottom-up and top-down directions across all layers, enabling the learned visual dynamics at different levels of RNNs to communicate. It also leverages a memory decoupling loss to keep the memory cells from learning redundant features. We further propose a new curriculum learning strategy to force PredRNN to learn long-term dynamics from context frames, which can be generalized to most sequence-to-sequence models. We provide detailed ablation studies to verify the effectiveness of each component. Our approach is shown to obtain highly competitive results on five datasets for both action-free and action-conditioned predictive learning scenarios.

preprint2022arXiv

Privacy and Robustness in Federated Learning: Attacks and Defenses

As data are increasingly being stored in different silos and societies becoming more aware of data privacy issues, the traditional centralized training of artificial intelligence (AI) models is facing efficiency and privacy challenges. Recently, federated learning (FL) has emerged as an alternative solution and continue to thrive in this new reality. Existing FL protocol design has been shown to be vulnerable to adversaries within or outside of the system, compromising data privacy and system robustness. Besides training powerful global models, it is of paramount importance to design FL systems that have privacy guarantees and are resistant to different types of adversaries. In this paper, we conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic. We highlight the intuitions, key techniques as well as fundamental assumptions adopted by various attacks and defenses. Finally, we discuss promising future research directions towards robust and privacy-preserving federated learning.

preprint2022arXiv

Self-organization Preserved Graph Structure Learning with Principle of Relevant Information

Most Graph Neural Networks follow the message-passing paradigm, assuming the observed structure depicts the ground-truth node relationships. However, this fundamental assumption cannot always be satisfied, as real-world graphs are always incomplete, noisy, or redundant. How to reveal the inherent graph structure in a unified way remains under-explored. We proposed PRI-GSL, a Graph Structure Learning framework guided by the Principle of Relevant Information, providing a simple and unified framework for identifying the self-organization and revealing the hidden structure. PRI-GSL learns a structure that contains the most relevant yet least redundant information quantified by von Neumann entropy and Quantum Jensen-Shannon divergence. PRI-GSL incorporates the evolution of quantum continuous walk with graph wavelets to encode node structural roles, showing in which way the nodes interplay and self-organize with the graph structure. Extensive experiments demonstrate the superior effectiveness and robustness of PRI-GSL.

preprint2022arXiv

Sequential Recommendation via Stochastic Self-Attention

Sequential recommendation models the dynamics of a user's previous behaviors in order to forecast the next item, and has drawn a lot of attention. Transformer-based approaches, which embed items as vectors and use dot-product self-attention to measure the relationship between items, demonstrate superior capabilities among existing sequential methods. However, users' real-world sequential behaviors are \textit{\textbf{uncertain}} rather than deterministic, posing a significant challenge to present techniques. We further suggest that dot-product-based approaches cannot fully capture \textit{\textbf{collaborative transitivity}}, which can be derived in item-item transitions inside sequences and is beneficial for cold start items. We further argue that BPR loss has no constraint on positive and sampled negative items, which misleads the optimization. We propose a novel \textbf{STO}chastic \textbf{S}elf-\textbf{A}ttention~(STOSA) to overcome these issues. STOSA, in particular, embeds each item as a stochastic Gaussian distribution, the covariance of which encodes the uncertainty. We devise a novel Wasserstein Self-Attention module to characterize item-item position-wise relationships in sequences, which effectively incorporates uncertainty into model training. Wasserstein attentions also enlighten the collaborative transitivity learning as it satisfies triangle inequality. Moreover, we introduce a novel regularization term to the ranking loss, which assures the dissimilarity between positive and the negative items. Extensive experiments on five real-world benchmark datasets demonstrate the superiority of the proposed model over state-of-the-art baselines, especially on cold start items. The code is available in \url{https://github.com/zfan20/STOSA}.

preprint2022arXiv

TaSPM: Targeted Sequential Pattern Mining

Sequential pattern mining (SPM) is an important technique of pattern mining, which has many applications in reality. Although many efficient sequential pattern mining algorithms have been proposed, there are few studies can focus on target sequences. Targeted querying sequential patterns can not only reduce the number of sequences generated by SPM, but also improve the efficiency of users in performing pattern analysis. The current algorithms available on targeted sequence querying are based on specific scenarios and cannot be generalized to other applications. In this paper, we formulate the problem of targeted sequential pattern mining and propose a generic framework namely TaSPM, based on the fast CM-SPAM algorithm. What's more, to improve the efficiency of TaSPM on large-scale datasets and multiple-items-based sequence datasets, we propose several pruning strategies to reduce meaningless operations in mining processes. Totally four pruning strategies are designed in TaSPM, and hence it can terminate unnecessary pattern extensions quickly and achieve better performance. Finally, we conduct extensive experiments on different datasets to compare the existing SPM algorithms with TaSPM. Experiments show that the novel targeted mining algorithm TaSPM can achieve faster running time and less memory consumption.

preprint2022arXiv

Towards Revenue Maximization with Popular and Profitable Products

Economic-wise, a common goal for companies conducting marketing is to maximize the return revenue/profit by utilizing the various effective marketing strategies. Consumer behavior is crucially important in economy and targeted marketing, in which behavioral economics can provide valuable insights to identify the biases and profit from customers. Finding credible and reliable information on products' profitability is, however, quite difficult since most products tends to peak at certain times w.r.t. seasonal sales cycle in a year. On-Shelf Availability (OSA) plays a key factor for performance evaluation. Besides, staying ahead of hot product trends means we can increase marketing efforts without selling out the inventory. To fulfill this gap, in this paper, we first propose a general profit-oriented framework to address the problem of revenue maximization based on economic behavior, and compute the 0n-shelf Popular and most Profitable Products (OPPPs) for the targeted marketing. To tackle the revenue maximization problem, we model the k-satisfiable product concept and propose an algorithmic framework for searching OPPP and its variants. Extensive experiments are conducted on several real-world datasets to evaluate the effectiveness and efficiency of the proposed algorithm.

preprint2022arXiv

Towards Target Sequential Rules

In many real-world applications, sequential rule mining (SRM) can provide prediction and recommendation functions for a variety of services. It is an important technique of pattern mining to discover all valuable rules that belong to high-frequency and high-confidence sequential rules. Although several algorithms of SRM are proposed to solve various practical problems, there are no studies on target sequential rules. Targeted sequential rule mining aims at mining the interesting sequential rules that users focus on, thus avoiding the generation of other invalid and unnecessary rules. This approach can further improve the efficiency of users in analyzing rules and reduce the consumption of data resources. In this paper, we provide the relevant definitions of target sequential rule and formulate the problem of targeted sequential rule mining. Furthermore, we propose an efficient algorithm, called targeted sequential rule mining (TaSRM). Several pruning strategies and an optimization are introduced to improve the efficiency of TaSRM. Finally, a large number of experiments are conducted on different benchmarks, and we analyze the results in terms of their running time, memory consumption, and scalability, as well as query cases with different query rules. It is shown that the novel algorithm TaSRM and its variants can achieve better experimental performance compared to the existing baseline algorithm.

preprint2022arXiv

XLTime: A Cross-Lingual Knowledge Transfer Framework for Temporal Expression Extraction

Temporal Expression Extraction (TEE) is essential for understanding time in natural language. It has applications in Natural Language Processing (NLP) tasks such as question answering, information retrieval, and causal inference. To date, work in this area has mostly focused on English as there is a scarcity of labeled data for other languages. We propose XLTime, a novel framework for multilingual TEE. XLTime works on top of pre-trained language models and leverages multi-task learning to prompt cross-language knowledge transfer both from English and within the non-English languages. XLTime alleviates problems caused by a shortage of data in the target language. We apply XLTime with different language models and show that it outperforms the previous automatic SOTA methods on French, Spanish, Portuguese, and Basque, by large margins. XLTime also closes the gap considerably on the handcrafted HeidelTime method.

preprint2021arXiv

A Comprehensive Survey on Community Detection with Deep Learning

A community reveals the features and connections of its members that are different from those in other communities in a network. Detecting communities is of great significance in network analysis. Despite the classical spectral clustering and statistical inference methods, we notice a significant development of deep learning techniques for community detection in recent years with their advantages in handling high dimensional network data. Hence, a comprehensive overview of community detection's latest progress through deep learning is timely to academics and practitioners. This survey devises and proposes a new taxonomy covering different state-of-the-art methods, including deep learning-based models upon deep neural networks, deep nonnegative matrix factorization and deep sparse filtering. The main category, i.e., deep neural networks, is further divided into convolutional networks, graph attention networks, generative adversarial networks and autoencoders. The survey also summarizes the popular benchmark data sets, evaluation metrics, and open-source implementations to address experimentation settings. We then discuss the practical applications of community detection in various domains and point to implementation scenarios. Finally, we outline future directions by suggesting challenging topics in this fast-growing deep learning field.

preprint2021arXiv

Dynamic Bicycle Dispatching of Dockless Public Bicycle-sharing Systems using Multi-objective Reinforcement Learning

As a new generation of Public Bicycle-sharing Systems (PBS), the dockless PBS (DL-PBS) is an important application of cyber-physical systems and intelligent transportation. How to use AI to provide efficient bicycle dispatching solutions based on dynamic bicycle rental demand is an essential issue for DL-PBS. In this paper, we propose a dynamic bicycle dispatching algorithm based on multi-objective reinforcement learning (MORL-BD) to provide the optimal bicycle dispatching solution for DL-PBS. We model the DL-PBS system from the perspective of CPS and use deep learning to predict the layout of bicycle parking spots and the dynamic demand of bicycle dispatching. We define the multi-route bicycle dispatching problem as a multi-objective optimization problem by considering the optimization objectives of dispatching costs, dispatch truck's initial load, workload balance among the trucks, and the dynamic balance of bicycle supply and demand. On this basis, the collaborative multi-route bicycle dispatching problem among multiple dispatch trucks is modeled as a multi-agent MORL model. All dispatch paths between parking spots are defined as state spaces, and the reciprocal of dispatching costs is defined as a reward. Each dispatch truck is equipped with an agent to learn the optimal dispatch path in the dynamic DL-PBS network. We create an elite list to store the Pareto optimal solutions of bicycle dispatch paths found in each action, and finally, get the Pareto frontier. Experimental results on the actual DL-PBS systems show that compared with existing methods, MORL-BD can find a higher quality Pareto frontier with less execution time.

preprint2021arXiv

Dynamic Graph Collaborative Filtering

Dynamic recommendation is essential for modern recommender systems to provide real-time predictions based on sequential data. In real-world scenarios, the popularity of items and interests of users change over time. Based on this assumption, many previous works focus on interaction sequences and learn evolutionary embeddings of users and items. However, we argue that sequence-based models are not able to capture collaborative information among users and items directly. Here we propose Dynamic Graph Collaborative Filtering (DGCF), a novel framework leveraging dynamic graphs to capture collaborative and sequential relations of both items and users at the same time. We propose three update mechanisms: zero-order 'inheritance', first-order 'propagation', and second-order 'aggregation', to represent the impact on a user or item when a new interaction occurs. Based on them, we update related user and item embeddings simultaneously when interactions occur in turn, and then use the latest embeddings to make recommendations. Extensive experiments conducted on three public datasets show that DGCF significantly outperforms the state-of-the-art dynamic recommendation methods up to 30. Our approach achieves higher performance when the dataset contains less action repetition, indicating the effectiveness of integrating dynamic collaborative information.

preprint2021arXiv

Dynamic Planning of Bicycle Stations in Dockless Public Bicycle-sharing System Using Gated Graph Neural Network

Benefiting from convenient cycling and flexible parking locations, the Dockless Public Bicycle-sharing (DL-PBS) network becomes increasingly popular in many countries. However, redundant and low-utility stations waste public urban space and maintenance costs of DL-PBS vendors. In this paper, we propose a Bicycle Station Dynamic Planning (BSDP) system to dynamically provide the optimal bicycle station layout for the DL-PBS network. The BSDP system contains four modules: bicycle drop-off location clustering, bicycle-station graph modeling, bicycle-station location prediction, and bicycle-station layout recommendation. In the bicycle drop-off location clustering module, candidate bicycle stations are clustered from each spatio-temporal subset of the large-scale cycling trajectory records. In the bicycle-station graph modeling module, a weighted digraph model is built based on the clustering results and inferior stations with low station revenue and utility are filtered. Then, graph models across time periods are combined to create a graph sequence model. In the bicycle-station location prediction module, the GGNN model is used to train the graph sequence data and dynamically predict bicycle stations in the next period. In the bicycle-station layout recommendation module, the predicted bicycle stations are fine-tuned according to the government urban management plan, which ensures that the recommended station layout is conducive to city management, vendor revenue, and user convenience. Experiments on actual DL-PBS networks verify the effectiveness, accuracy and feasibility of the proposed BSDP system.

preprint2021arXiv

Enriching Non-Autoregressive Transformer with Syntactic and SemanticStructures for Neural Machine Translation

The non-autoregressive models have boosted the efficiency of neural machine translation through parallelized decoding at the cost of effectiveness when comparing with the autoregressive counterparts. In this paper, we claim that the syntactic and semantic structures among natural language are critical for non-autoregressive machine translation and can further improve the performance. However, these structures are rarely considered in the existing non-autoregressive models. Inspired by this intuition, we propose to incorporate the explicit syntactic and semantic structures of languages into a non-autoregressive Transformer, for the task of neural machine translation. Moreover, we also consider the intermediate latent alignment within target sentences to better learn the long-term token dependencies. Experimental results on two real-world datasets (i.e., WMT14 En-De and WMT16 En-Ro) show that our model achieves a significantly faster speed, as well as keeps the translation quality when compared with several state-of-the-art non-autoregressive models.

preprint2021arXiv

Heterogeneous Similarity Graph Neural Network on Electronic Health Records

Mining Electronic Health Records (EHRs) becomes a promising topic because of the rich information they contain. By learning from EHRs, machine learning models can be built to help human experts to make medical decisions and thus improve healthcare quality. Recently, many models based on sequential or graph models are proposed to achieve this goal. EHRs contain multiple entities and relations and can be viewed as a heterogeneous graph. However, previous studies ignore the heterogeneity in EHRs. On the other hand, current heterogeneous graph neural networks cannot be simply used on an EHR graph because of the existence of hub nodes in it. To address this issue, we propose Heterogeneous Similarity Graph Neural Network (HSGNN) analyze EHRs with a novel heterogeneous GNN. Our framework consists of two parts: one is a preprocessing method and the other is an end-to-end GNN. The preprocessing method normalizes edges and splits the EHR graph into multiple homogeneous graphs while each homogeneous graph contains partial information of the original EHR graph. The GNN takes all homogeneous graphs as input and fuses all of them into one graph to make a prediction. Experimental results show that HSGNN outperforms other baselines in the diagnosis prediction task.

preprint2021arXiv

Hyper Meta-Path Contrastive Learning for Multi-Behavior Recommendation

User purchasing prediction with multi-behavior information remains a challenging problem for current recommendation systems. Various methods have been proposed to address it via leveraging the advantages of graph neural networks (GNNs) or multi-task learning. However, most existing works do not take the complex dependencies among different behaviors of users into consideration. They utilize simple and fixed schemes, like neighborhood information aggregation or mathematical calculation of vectors, to fuse the embeddings of different user behaviors to obtain a unified embedding to represent a user's behavioral patterns which will be used in downstream recommendation tasks. To tackle the challenge, in this paper, we first propose the concept of hyper meta-path to construct hyper meta-paths or hyper meta-graphs to explicitly illustrate the dependencies among different behaviors of a user. How to obtain a unified embedding for a user from hyper meta-paths and avoid the previously mentioned limitations simultaneously is critical. Thanks to the recent success of graph contrastive learning, we leverage it to learn embeddings of user behavior patterns adaptively instead of assigning a fixed scheme to understand the dependencies among different behaviors. A new graph contrastive learning based framework is proposed by coupling with hyper meta-paths, namely HMG-CR, which consistently and significantly outperforms all baselines in extensive comparison experiments.

preprint2021arXiv

KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning

Generative commonsense reasoning which aims to empower machines to generate sentences with the capacity of reasoning over a set of concepts is a critical bottleneck for text generation. Even the state-of-the-art pre-trained language generation models struggle at this task and often produce implausible and anomalous sentences. One reason is that they rarely consider incorporating the knowledge graph which can provide rich relational information among the commonsense concepts. To promote the ability of commonsense reasoning for text generation, we propose a novel knowledge graph augmented pre-trained language generation model KG-BART, which encompasses the complex relations of concepts through the knowledge graph and produces more logical and natural sentences as output. Moreover, KG-BART can leverage the graph attention to aggregate the rich concept semantics that enhances the model generalization on unseen concept sets. Experiments on benchmark CommonGen dataset verify the effectiveness of our proposed approach by comparing with several strong pre-trained language generation models, particularly KG-BART outperforms BART by 5.80, 4.60, in terms of BLEU-3, 4. Moreover, we also show that the generated context by our model can work as background scenarios to benefit downstream commonsense QA tasks.

preprint2021arXiv

Pairwise Learning for Name Disambiguation in Large-Scale Heterogeneous Academic Networks

Name disambiguation aims to identify unique authors with the same name. Existing name disambiguation methods always exploit author attributes to enhance disambiguation results. However, some discriminative author attributes (e.g., email and affiliation) may change because of graduation or job-hopping, which will result in the separation of the same author's papers in digital libraries. Although these attributes may change, an author's co-authors and research topics do not change frequently with time, which means that papers within a period have similar text and relation information in the academic network. Inspired by this idea, we introduce Multi-view Attention-based Pairwise Recurrent Neural Network (MA-PairRNN) to solve the name disambiguation problem. We divided papers into small blocks based on discriminative author attributes and blocks of the same author will be merged according to pairwise classification results of MA-PairRNN. MA-PairRNN combines heterogeneous graph embedding learning and pairwise similarity learning into a framework. In addition to attribute and structure information, MA-PairRNN also exploits semantic information by meta-path and generates node representation in an inductive way, which is scalable to large graphs. Furthermore, a semantic-level attention mechanism is adopted to fuse multiple meta-path based representations. A Pseudo-Siamese network consisting of two RNNs takes two paper sequences in publication time order as input and outputs their similarity. Results on two real-world datasets demonstrate that our framework has a significant and consistent improvement of performance on the name disambiguation task. It was also demonstrated that MA-PairRNN can perform well with a small amount of training data and have better generalization ability across different research areas.

preprint2021arXiv

Understanding WeChat User Preferences and "Wow" Diffusion

WeChat is the largest social instant messaging platform in China, with 1.1 billion monthly active users. "Top Stories" is a novel friend-enhanced recommendation engine in WeChat, in which users can read articles based on preferences of both their own and their friends. Specifically, when a user reads an article by opening it, the "click" behavior is private. Moreover, if the user clicks the "wow" button, (only) her/his direct connections will be aware of this action/preference. Based on the unique WeChat data, we aim to understand user preferences and "wow" diffusion in Top Stories at different levels. We have made some interesting discoveries. For instance, the "wow" probability of one user is negatively correlated with the number of connected components that are formed by her/his active friends, but the click probability is the opposite. We further study to what extent users' "wow" and click behavior can be predicted from their social connections. To address this problem, we present a hierarchical graph representation learning based model DiffuseGNN, which is capable of capturing the structure-based social observations discovered above. Our experiments show that the proposed method can significantly improve the prediction performance compared with alternative methods.

preprint2020arXiv

Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection

The graph-based model can help to detect suspicious fraud online. Owing to the development of Graph Neural Networks~(GNNs), prior research work has proposed many GNN-based fraud detection frameworks based on either homogeneous graphs or heterogeneous graphs. These work follow the existing GNN framework by aggregating the neighboring information to learn the node embedding, which lays on the assumption that the neighbors share similar context, features, and relations. However, the inconsistency problem is hardly investigated, i.e., the context inconsistency, feature inconsistency, and relation inconsistency. In this paper, we introduce these inconsistencies and design a new GNN framework, $\mathsf{GraphConsis}$, to tackle the inconsistency problem: (1) for the context inconsistency, we propose to combine the context embeddings with node features, (2) for the feature inconsistency, we design a consistency score to filter the inconsistent neighbors and generate corresponding sampling probability, and (3) for the relation inconsistency, we learn a relation attention weights associated with the sampled nodes. Empirical analysis on four datasets indicates the inconsistency problem is crucial in a fraud detection task. The extensive experiments prove the effectiveness of $\mathsf{GraphConsis}$. We also released a GNN-based fraud detection toolbox with implementations of SOTA models. The code is available at https://github.com/safe-graph/DGFraud.

preprint2020arXiv

Attentional Graph Convolutional Networks for Knowledge Concept Recommendation in MOOCs in a Heterogeneous View

Massive open online courses are becoming a modish way for education, which provides a large-scale and open-access learning opportunity for students to grasp the knowledge. To attract students' interest, the recommendation system is applied by MOOCs providers to recommend courses to students. However, as a course usually consists of a number of video lectures, with each one covering some specific knowledge concepts, directly recommending courses overlook students'interest to some specific knowledge concepts. To fill this gap, in this paper, we study the problem of knowledge concept recommendation. We propose an end-to-end graph neural network-based approach calledAttentionalHeterogeneous Graph Convolutional Deep Knowledge Recommender(ACKRec) for knowledge concept recommendation in MOOCs. Like other recommendation problems, it suffers from sparsity issues. To address this issue, we leverage both content information and context information to learn the representation of entities via graph convolution network. In addition to students and knowledge concepts, we consider other types of entities (e.g., courses, videos, teachers) and construct a heterogeneous information network to capture the corresponding fruitful semantic relationships among different types of entities and incorporate them into the representation learning process. Specifically, we use meta-path on the HIN to guide the propagation of students' preferences. With the help of these meta-paths, the students' preference distribution with respect to a candidate knowledge concept can be captured. Furthermore, we propose an attention mechanism to adaptively fuse the context information from different meta-paths, in order to capture the different interests of different students. The promising experiment results show that the proposedACKRecis able to effectively recommend knowledge concepts to students pursuing online learning in MOOCs.

preprint2020arXiv

BasConv: Aggregating Heterogeneous Interactions for Basket Recommendation with Graph Convolutional Neural Network

Within-basket recommendation reduces the exploration time of users, where the user's intention of the basket matters. The intent of a shopping basket can be retrieved from both user-item collaborative filtering signals and multi-item correlations. By defining a basket entity to represent the basket intent, we can model this problem as a basket-item link prediction task in the User-Basket-Item~(UBI) graph. Previous work solves the problem by leveraging user-item interactions and item-item interactions simultaneously. However, collectivity and heterogeneity characteristics are hardly investigated before. Collectivity defines the semantics of each node which should be aggregated from both directly and indirectly connected neighbors. Heterogeneity comes from multi-type interactions as well as multi-type nodes in the UBI graph. To this end, we propose a new framework named \textbf{BasConv}, which is based on the graph convolutional neural network. Our BasConv model has three types of aggregators specifically designed for three types of nodes. They collectively learn node embeddings from both neighborhood and high-order context. Additionally, the interactive layers in the aggregators can distinguish different types of interactions. Extensive experiments on two real-world datasets prove the effectiveness of BasConv. Our code is available online at https://github.com/JimLiu96/basConv.

preprint2020arXiv

Commonsense Evidence Generation and Injection in Reading Comprehension

Human tackle reading comprehension not only based on the given context itself but often rely on the commonsense beyond. To empower the machine with commonsense reasoning, in this paper, we propose a Commonsense Evidence Generation and Injection framework in reading comprehension, named CEGI. The framework injects two kinds of auxiliary commonsense evidence into comprehensive reading to equip the machine with the ability of rational thinking. Specifically, we build two evidence generators: the first generator aims to generate textual evidence via a language model; the other generator aims to extract factual evidence (automatically aligned text-triples) from a commonsense knowledge graph after graph completion. Those evidences incorporate contextual commonsense and serve as the additional inputs to the model. Thereafter, we propose a deep contextual encoder to extract semantic relationships among the paragraph, question, option, and evidence. Finally, we employ a capsule network to extract different linguistic units (word and phrase) from the relations, and dynamically predict the optimal option based on the extracted units. Experiments on the CosmosQA dataset demonstrate that the proposed CEGI model outperforms the current state-of-the-art approaches and achieves the accuracy (83.6%) on the leaderboard.

preprint2020arXiv

Deep Collaborative Embedding for information cascade prediction

Recently, information cascade prediction has attracted increasing interest from researchers, but it is far from being well solved partly due to the three defects of the existing works. First, the existing works often assume an underlying information diffusion model, which is impractical in real world due to the complexity of information diffusion. Second, the existing works often ignore the prediction of the infection order, which also plays an important role in social network analysis. At last, the existing works often depend on the requirement of underlying diffusion networks which are likely unobservable in practice. In this paper, we aim at the prediction of both node infection and infection order without requirement of the knowledge about the underlying diffusion mechanism and the diffusion network, where the challenges are two-fold. The first is what cascading characteristics of nodes should be captured and how to capture them, and the second is that how to model the non-linear features of nodes in information cascades. To address these challenges, we propose a novel model called Deep Collaborative Embedding (DCE) for information cascade prediction, which can capture not only the node structural property but also two kinds of node cascading characteristics. We propose an auto-encoder based collaborative embedding framework to learn the node embeddings with cascade collaboration and node collaboration, in which way the non-linearity of information cascades can be effectively captured. The results of extensive experiments conducted on real-world datasets verify the effectiveness of our approach.

preprint2020arXiv

Differentially Private Deep Learning with Smooth Sensitivity

Ensuring the privacy of sensitive data used to train modern machine learning models is of paramount importance in many areas of practice. One approach to study these concerns is through the lens of differential privacy. In this framework, privacy guarantees are generally obtained by perturbing models in such a way that specifics of data used to train the model are made ambiguous. A particular instance of this approach is through a "teacher-student" framework, wherein the teacher, who owns the sensitive data, provides the student with useful, but noisy, information, hopefully allowing the student model to perform well on a given task without access to particular features of the sensitive data. Because stronger privacy guarantees generally involve more significant perturbation on the part of the teacher, deploying existing frameworks fundamentally involves a trade-off between student's performance and privacy guarantee. One of the most important techniques used in previous works involves an ensemble of teacher models, which return information to a student based on a noisy voting procedure. In this work, we propose a novel voting mechanism with smooth sensitivity, which we call Immutable Noisy ArgMax, that, under certain conditions, can bear very large random noising from the teacher without affecting the useful information transferred to the student. Compared with previous work, our approach improves over the state-of-the-art methods on all measures, and scale to larger tasks with both better performance and stronger privacy ($ε\approx 0$). This new proposed framework can be applied with any machine learning models, and provides an appealing solution for tasks that requires training on a large amount of data.

preprint2020arXiv

DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis

This paper focuses on learning domain-oriented language models driven by end tasks, which aims to combine the worlds of both general-purpose language models (such as ELMo and BERT) and domain-specific language understanding. We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora. This helps in learning domain language models with low-resources. Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.

preprint2020arXiv

Enhancing Graph Neural Network-based Fraud Detectors against Camouflaged Fraudsters

Graph Neural Networks (GNNs) have been widely applied to fraud detection problems in recent years, revealing the suspiciousness of nodes by aggregating their neighborhood information via different relations. However, few prior works have noticed the camouflage behavior of fraudsters, which could hamper the performance of GNN-based fraud detectors during the aggregation process. In this paper, we introduce two types of camouflages based on recent empirical studies, i.e., the feature camouflage and the relation camouflage. Existing GNNs have not addressed these two camouflages, which results in their poor performance in fraud detection problems. Alternatively, we propose a new model named CAmouflage-REsistant GNN (CARE-GNN), to enhance the GNN aggregation process with three unique modules against camouflages. Concretely, we first devise a label-aware similarity measure to find informative neighboring nodes. Then, we leverage reinforcement learning (RL) to find the optimal amounts of neighbors to be selected. Finally, the selected neighbors across different relations are aggregated together. Comprehensive experiments on two real-world fraud datasets demonstrate the effectiveness of the RL algorithm. The proposed CARE-GNN also outperforms state-of-the-art GNNs and GNN-based fraud detectors. We integrate all GNN-based fraud detectors as an opensource toolbox: https://github.com/safe-graph/DGFraud. The CARE-GNN code and datasets are available at https://github.com/YingtongDou/CARE-GNN.

preprint2020arXiv

Entity Synonym Discovery via Multipiece Bilateral Context Matching

Being able to automatically discover synonymous entities in an open-world setting benefits various tasks such as entity disambiguation or knowledge graph canonicalization. Existing works either only utilize entity features, or rely on structured annotations from a single piece of context where the entity is mentioned. To leverage diverse contexts where entities are mentioned, in this paper, we generalize the distributional hypothesis to a multi-context setting and propose a synonym discovery framework that detects entity synonyms from free-text corpora with considerations on effectiveness and robustness. As one of the key components in synonym discovery, we introduce a neural network model SYNONYMNET to determine whether or not two given entities are synonym with each other. Instead of using entities features, SYNONYMNET makes use of multiple pieces of contexts in which the entity is mentioned, and compares the context-level similarity via a bilateral matching schema. Experimental results demonstrate that the proposed model is able to detect synonym sets that are not observed during training on both generic and domain-specific datasets: Wiki+Freebase, PubMed+UMLS, and MedBook+MKG, with up to 4.16% improvement in terms of Area Under the Curve and 3.19% in terms of Mean Average Precision compared to the best baseline method.

preprint2020arXiv

Fairness Constraints in Semi-supervised Learning

Fairness in machine learning has received considerable attention. However, most studies on fair learning focus on either supervised learning or unsupervised learning. Very few consider semi-supervised settings. Yet, in reality, most machine learning tasks rely on large datasets that contain both labeled and unlabeled data. One of key issues with fair learning is the balance between fairness and accuracy. Previous studies arguing that increasing the size of the training set can have a better trade-off. We believe that increasing the training set with unlabeled data may achieve the similar result. Hence, we develop a framework for fair semi-supervised learning, which is formulated as an optimization problem. This includes classifier loss to optimize accuracy, label propagation loss to optimize unlabled data prediction, and fairness constraints over labeled and unlabeled data to optimize the fairness level. The framework is conducted in logistic regression and support vector machines under the fairness metrics of disparate impact and disparate mistreatment. We theoretically analyze the source of discrimination in semi-supervised learning via bias, variance and noise decomposition. Extensive experiments show that our method is able to achieve fair semi-supervised learning, and reach a better trade-off between accuracy and fairness than fair supervised learning.

preprint2020arXiv

GCN for HIN via Implicit Utilization of Attention and Meta-paths

Heterogeneous information network (HIN) embedding, aiming to map the structure and semantic information in a HIN to distributed representations, has drawn considerable research attention. Graph neural networks for HIN embeddings typically adopt a hierarchical attention (including node-level and meta-path-level attentions) to capture the information from meta-path-based neighbors. However, this complicated attention structure often cannot achieve the function of selecting meta-paths due to severe overfitting. Moreover, when propagating information, these methods do not distinguish direct (one-hop) meta-paths from indirect (multi-hop) ones. But from the perspective of network science, direct relationships are often believed to be more essential, which can only be used to model direct information propagation. To address these limitations, we propose a novel neural network method via implicitly utilizing attention and meta-paths, which can relieve the severe overfitting brought by the current over-parameterized attention mechanisms on HIN. We first use the multi-layer graph convolutional network (GCN) framework, which performs a discriminative aggregation at each layer, along with stacking the information propagation of direct linked meta-paths layer-by-layer, realizing the function of attentions for selecting meta-paths in an indirect way. We then give an effective relaxation and improvement via introducing a new propagation operation which can be separated from aggregation. That is, we first model the whole propagation process with well-defined probabilistic diffusion dynamics, and then introduce a random graph-based constraint which allows it to reduce noise with the increase of layers. Extensive experiments demonstrate the superiority of the new approach over state-of-the-art methods.

preprint2020arXiv

Generative Temporal Link Prediction via Self-tokenized Sequence Modeling

We formalize networks with evolving structures as temporal networks and propose a generative link prediction model, Generative Link Sequence Modeling (GLSM), to predict future links for temporal networks. GLSM captures the temporal link formation patterns from the observed links with a sequence modeling framework and has the ability to generate the emerging links by inferring from the probability distribution on the potential future links. To avoid overfitting caused by treating each link as a unique token, we propose a self-tokenization mechanism to transform each raw link in the network to an abstract aggregation token automatically. The self-tokenization is seamlessly integrated into the sequence modeling framework, which allows the proposed GLSM model to have the generalization capability to discover link formation patterns beyond raw link sequences. We compare GLSM with the existing state-of-art methods on five real-world datasets. The experimental results demonstrate that GLSM obtains future positive links effectively in a generative fashion while achieving the best performance (2-10\% improvements on AUC) among other alternatives.

preprint2020arXiv

Graph Learning Approaches to Recommender Systems: A Review

Recent years have witnessed the fast development of the emerging topic of Graph Learning based Recommender Systems (GLRS). GLRS mainly employ the advanced graph learning approaches to model users' preferences and intentions as well as items' characteristics and popularity for Recommender Systems (RS). Differently from conventional RS, including content based filtering and collaborative filtering, GLRS are built on simple or complex graphs where various objects, e.g., users, items, and attributes, are explicitly or implicitly connected. With the rapid development of graph learning, exploring and exploiting homogeneous or heterogeneous relations in graphs is a promising direction for building advanced RS. In this paper, we provide a systematic review of GLRS, on how they obtain the knowledge from graphs to improve the accuracy, reliability and explainability for recommendations. First, we characterize and formalize GLRS, and then summarize and categorize the key challenges in this new research area. Then, we survey the most recent and important developments in the area. Finally, we share some new research directions in this vibrant area.

preprint2020arXiv

Hybrid Deep Embedding for Recommendations with Dynamic Aspect-Level Explanations

Explainable recommendation is far from being well solved partly due to three challenges. The first is the personalization of preference learning, which requires that different items/users have different contributions to the learning of user preference or item quality. The second one is dynamic explanation, which is crucial for the timeliness of recommendation explanations. The last one is the granularity of explanations. In practice, aspect-level explanations are more persuasive than item-level or user-level ones. In this paper, to address these challenges simultaneously, we propose a novel model called Hybrid Deep Embedding (HDE) for aspect-based explainable recommendations, which can make recommendations with dynamic aspect-level explanations. The main idea of HDE is to learn the dynamic embeddings of users and items for rating prediction and the dynamic latent aspect preference/quality vectors for the generation of aspect-level explanations, through fusion of the dynamic implicit feedbacks extracted from reviews and the attentive user-item interactions. Particularly, as the aspect preference/quality of users/items is learned automatically, HDE is able to capture the impact of aspects that are not mentioned in reviews of a user or an item. The extensive experiments conducted on real datasets verify the recommending performance and explainability of HDE. The source code of our work is available at \url{https://github.com/lola63/HDE-Python}

preprint2020arXiv

Influence Maximization with Spontaneous User Adoption

We incorporate self activation into influence propagation and propose the self-activation independent cascade (SAIC) model: nodes may be self activated besides being selected as seeds, and influence propagates from both selected seeds and self activated nodes. Self activation reflects the real-world scenarios such as people naturally share product recommendations with their friends even without marketing intervention. It also leads to two new forms of optimization problems: (a) {\em preemptive influence maximization (PIM)}, which aims to find $k$ nodes that, if self-activated, can reach the most number of nodes before other self-activated nodes; and (b) {\em boosted preemptive influence maximization (BPIM)}, which aims to select $k$ seeds that are guaranteed to be activated and can reach the most number of nodes before other self-activated nodes. We propose scalable algorithms for PIM and BPIM and prove that they achieve $1-\varepsilon$ approximation for PIM and $1-1/e-\varepsilon$ approximation for BPIM, for any $\varepsilon > 0$. Through extensive tests on real-world graphs, we demonstrate that our algorithms outperform the baseline algorithms significantly for the PIM problem in solution quality, and also outperform the baselines for BPIM when self-activation behaviors are non-uniform across nodes.

preprint2020arXiv

Joint Training Capsule Network for Cold Start Recommendation

This paper proposes a novel neural network, joint training capsule network (JTCN), for the cold start recommendation task. We propose to mimic the high-level user preference other than the raw interaction history based on the side information for the fresh users. Specifically, an attentive capsule layer is proposed to aggregate high-level user preference from the low-level interaction history via a dynamic routing-by-agreement mechanism. Moreover, JTCN jointly trains the loss for mimicking the user preference and the softmax loss for the recommendation together in an end-to-end manner. Experiments on two publicly available datasets demonstrate the effectiveness of the proposed model. JTCN improves other state-of-the-art methods at least 7.07% for CiteULike and 16.85% for Amazon in terms of Recall@100 in cold start recommendation.

preprint2020arXiv

Lifelong Property Price Prediction: A Case Study for the Toronto Real Estate Market

We present Luce, the first life-long predictive model for automated property valuation. Luce addresses two critical issues of property valuation: the lack of recent sold prices and the sparsity of house data. It is designed to operate on a limited volume of recent house transaction data. As a departure from prior work, Luce organizes the house data in a heterogeneous information network (HIN) where graph nodes are house entities and attributes that are important for house price valuation. We employ a Graph Convolutional Network (GCN) to extract the spatial information from the HIN for house-related data like geographical locations, and then use a Long Short Term Memory (LSTM) network to model the temporal dependencies for house transaction data over time. Unlike prior work, Luce can make effective use of the limited house transactions data in the past few months to update valuation information for all house entities within the HIN. By providing a complete and up-to-date house valuation dataset, Luce thus massively simplifies the downstream valuation task for the targeting properties. We demonstrate the benefit of Luce by applying it to large, real-life datasets obtained from the Toronto real estate market. Extensive experimental results show that Luce not only significantly outperforms prior property valuation methods but also often reaches and sometimes exceeds the valuation accuracy given by independent experts when using the actual realization price as the ground truth.

preprint2020arXiv

Med2Meta: Learning Representations of Medical Concepts with Meta-Embeddings

Distributed representations of medical concepts have been used to support downstream clinical tasks recently. Electronic Health Records (EHR) capture different aspects of patients' hospital encounters and serve as a rich source for augmenting clinical decision making by learning robust medical concept embeddings. However, the same medical concept can be recorded in different modalities (e.g., clinical notes, lab results)-with each capturing salient information unique to that modality-and a holistic representation calls for relevant feature ensemble from all information sources. We hypothesize that representations learned from heterogeneous data types would lead to performance enhancement on various clinical informatics and predictive modeling tasks. To this end, our proposed approach makes use of meta-embeddings, embeddings aggregated from learned embeddings. Firstly, modality-specific embeddings for each medical concept is learned with graph autoencoders. The ensemble of all the embeddings is then modeled as a meta-embedding learning problem to incorporate their correlating and complementary information through a joint reconstruction. Empirical results of our model on both quantitative and qualitative clinical evaluations have shown improvements over state-of-the-art embedding models, thus validating our hypothesis.

preprint2020arXiv

Modeling relation paths for knowledge base completion via joint adversarial training

Knowledge Base Completion (KBC), which aims at determining the missing relations between entity pairs, has received increasing attention in recent years. Most existing KBC methods focus on either embedding the Knowledge Base (KB) into a specific semantic space or leveraging the joint probability of Random Walks (RWs) on multi-hop paths. Only a few unified models take both semantic and path-related features into consideration with adequacy. In this paper, we propose a novel method to explore the intrinsic relationship between the single relation (i.e. 1-hop path) and multi-hop paths between paired entities. We use Hierarchical Attention Networks (HANs) to select important relations in multi-hop paths and encode them into low-dimensional vectors. By treating relations and multi-hop paths as two different input sources, we use a feature extractor, which is shared by two downstream components (i.e. relation classifier and source discriminator), to capture shared/similar information between them. By joint adversarial training, we encourage our model to extract features from the multi-hop paths which are representative for relation completion. We apply the trained model (except for the source discriminator) to several large-scale KBs for relation completion. Experimental results show that our method outperforms existing path information-based approaches. Since each sub-module of our model can be well interpreted, our model can be applied to a large number of relation learning tasks.

preprint2020arXiv

More Than Privacy: Applying Differential Privacy in Key Areas of Artificial Intelligence

Artificial Intelligence (AI) has attracted a great deal of attention in recent years. However, alongside all its advancements, problems have also emerged, such as privacy violations, security issues and model fairness. Differential privacy, as a promising mathematical model, has several attractive properties that can help solve these problems, making it quite a valuable tool. For this reason, differential privacy has been broadly applied in AI but to date, no study has documented which differential privacy mechanisms can or have been leveraged to overcome its issues or the properties that make this possible. In this paper, we show that differential privacy can do more than just privacy preservation. It can also be used to improve security, stabilize learning, build fair models, and impose composition in selected areas of AI. With a focus on regular machine learning, distributed machine learning, deep learning, and multi-agent systems, the purpose of this article is to deliver a new view on many possibilities for improving AI performance with differential privacy techniques.

preprint2020arXiv

Not Just Cloud Privacy: Protecting Client Privacy in Teacher-Student Learning

Ensuring the privacy of sensitive data used to train modern machine learning models is of paramount importance in many areas of practice. One recent popular approach to study these concerns is using the differential privacy via a "teacher-student" model, wherein the teacher provides the student with useful, but noisy, information, hopefully allowing the student model to perform well on a given task. However, these studies only solve the privacy concerns of the teacher by assuming the student owns a public but unlabelled dataset. In real life, the student also has privacy concerns on its unlabelled data, so as to inquire about privacy protection on any data sent to the teacher. In this work, we re-design the privacy-preserving "teacher-student" model consisting of adopting both private arbitrary masking and local differential privacy, which protects the sensitive information of each student sample. However, the traditional training of teacher model is not robust on any perturbed data. We use the adversarial learning techniques to improve the robustness of the perturbed sample that supports returning good feedback without having all private information of each student sample. The experimental results demonstrate the effectiveness of our new privacy-preserving "teacher-student" model.

preprint2020arXiv

Optimizing Item and Subgroup Configurations for Social-Aware VR Shopping

Shopping in VR malls has been regarded as a paradigm shift for E-commerce, but most of the conventional VR shopping platforms are designed for a single user. In this paper, we envisage a scenario of VR group shopping, which brings major advantages over conventional group shopping in brick-and-mortar stores and Web shopping: 1) configure flexible display of items and partitioning of subgroups to address individual interests in the group, and 2) support social interactions in the subgroups to boost sales. Accordingly, we formulate the Social-aware VR Group-Item Configuration (SVGIC) problem to configure a set of displayed items for flexibly partitioned subgroups of users in VR group shopping. We prove SVGIC is NP-hard to approximate within $\frac{32}{31} - ε$. We design an approximation algorithm based on the idea of Co-display Subgroup Formation (CSF) to configure proper items for display to different subgroups of friends. Experimental results on real VR datasets and a user study with hTC VIVE manifest that our algorithms outperform baseline approaches by at least 30.1% of solution quality.

preprint2020arXiv

Robust Spammer Detection by Nash Reinforcement Learning

Online reviews provide product evaluations for customers to make decisions. Unfortunately, the evaluations can be manipulated using fake reviews ("spams") by professional spammers, who have learned increasingly insidious and powerful spamming strategies by adapting to the deployed detectors. Spamming strategies are hard to capture, as they can be varying quickly along time, different across spammers and target products, and more critically, remained unknown in most cases. Furthermore, most existing detectors focus on detection accuracy, which is not well-aligned with the goal of maintaining the trustworthiness of product evaluations. To address the challenges, we formulate a minimax game where the spammers and spam detectors compete with each other on their practical goals that are not solely based on detection accuracy. Nash equilibria of the game lead to stable detectors that are agnostic to any mixed detection strategies. However, the game has no closed-form solution and is not differentiable to admit the typical gradient-based algorithms. We turn the game into two dependent Markov Decision Processes (MDPs) to allow efficient stochastic optimization based on multi-armed bandit and policy gradient. We experiment on three large review datasets using various state-of-the-art spamming and detection strategies and show that the optimization algorithm can reliably find an equilibrial detector that can robustly and effectively prevent spammers with any mixed spamming strategies from attaining their practical goal. Our code is available at https://github.com/YingtongDou/Nash-Detect.

preprint2020arXiv

Target Privacy Preserving for Social Networks

In this paper, we incorporate the realistic scenario of key protection into link privacy preserving and propose the target-link privacy preserving (TPP) model: target links referred to as targets are the most important and sensitive objectives that would be intentionally attacked by adversaries, in order that need privacy protections, while other links of less privacy concerns are properly released to maintain the graph utility. The goal of TPP is to limit the target disclosure by deleting a budget limited set of alternative non-target links referred to as protectors to defend the adversarial link predictions for all targets. Traditional link privacy preserving treated all links as targets and concentrated on structural level protections in which serious link disclosure and high graph utility loss is still the bottleneck of graph releasing today, while TPP focuses on the target level protections in which key protection is implemented on a tiny fraction of critical targets to achieve better privacy protection and lower graph utility loss. Currently there is a lack of clear TPP problem definition, provable optimal or near optimal protector selection algorithms and scalable implementations on large-scale social graphs. Firstly, we introduce the TPP model and propose a dissimilarity function used for measuring the defense ability against privacy analyzing for the targets. We consider two different problems by budget assignment settings: 1) we protect all targets and to optimize the dissimilarity of all targets with a single budget; 2) besides the protections of all targets, we also care about the protection of each target by assigning a local budget to every target, considering two local protector selections. We also implement scalable implementations and experiments to demonstrate the effectiveness and efficiency of the proposed algorithms.

preprint2020arXiv

Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing (Technical Report)

Errors are prevalent in time series data, such as GPS trajectories or sensor readings. Existing methods focus more on anomaly detection but not on repairing the detected anomalies. By simply filtering out the dirty data via anomaly detection, applications could still be unreliable over the incomplete time series. Instead of simply discarding anomalies, we propose to (iteratively) repair them in time series data, by creatively bonding the beauty of temporal nature in anomaly detection with the widely considered minimum change principle in data repairing. Our major contributions include: (1) a novel framework of iterative minimum repairing (IMR) over time series data, (2) explicit analysis on convergence of the proposed iterative minimum repairing, and (3) efficient estimation of parameters in each iteration. Remarkably, with incremental computation, we reduce the complexity of parameter estimation from O(n) to O(1). Experiments on real datasets demonstrate the superiority of our proposal compared to the state-of-the-art approaches. In particular, we show that (the proposed) repairing indeed improves the time series classification application.

preprint2020arXiv

Uncovering Download Fraud Activities in Mobile App Markets

Download fraud is a prevalent threat in mobile App markets, where fraudsters manipulate the number of downloads of Apps via various cheating approaches. Purchased fake downloads can mislead recommendation and search algorithms and further lead to bad user experience in App markets. In this paper, we investigate download fraud problem based on a company's App Market, which is one of the most popular Android App markets. We release a honeypot App on the App Market and purchase fake downloads from fraudster agents to track fraud activities in the wild. Based on our interaction with the fraudsters, we categorize download fraud activities into three types according to their intentions: boosting front end downloads, optimizing App search ranking, and enhancing user acquisition&retention rate. For the download fraud aimed at optimizing App search ranking, we select, evaluate, and validate several features in identifying fake downloads based on billions of download data. To get a comprehensive understanding of download fraud, we further gather stances of App marketers, fraudster agencies, and market operators on download fraud. The followed analysis and suggestions shed light on the ways to mitigate download fraud in App markets and other social platforms. To the best of our knowledge, this is the first work that investigates the download fraud problem in mobile App markets.

preprint2020arXiv

User Memory Reasoning for Conversational Recommendation

We study a conversational recommendation model which dynamically manages users&#39; past (offline) preferences and current (online) requests through a structured and cumulative user memory knowledge graph, to allow for natural interactions and accurate recommendations. For this study, we create a new Memory Graph (MG) <--> Conversational Recommendation parallel corpus called MGConvRex with 7K+ human-to-human role-playing dialogs, grounded on a large-scale user memory bootstrapped from real-world user scenarios. MGConvRex captures human-level reasoning over user memory and has disjoint training/testing sets of users for zero-shot (cold-start) reasoning for recommendation. We propose a simple yet expandable formulation for constructing and updating the MG, and a reasoning model that predicts optimal dialog policies and recommendation items in unconstrained graph space. The prediction of our proposed model inherits the graph structure, providing a natural way to explain the model&#39;s recommendation. Experiments are conducted for both offline metrics and online simulation, showing competitive results.

preprint2019arXiv

A Comprehensive Survey on Graph Neural Networks

Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art graph neural networks into four categories, namely recurrent graph neural networks, convolutional graph neural networks, graph autoencoders, and spatial-temporal graph neural networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes, benchmark data sets, and model evaluation of graph neural networks. Finally, we propose potential research directions in this rapidly growing field.

preprint2019arXiv

Leveraging Semi-Supervised Learning for Fairness using Neural Networks

There has been a growing concern about the fairness of decision-making systems based on machine learning. The shortage of labeled data has been always a challenging problem facing machine learning based systems. In such scenarios, semi-supervised learning has shown to be an effective way of exploiting unlabeled data to improve upon the performance of model. Notably, unlabeled data do not contain label information which itself can be a significant source of bias in training machine learning systems. This inspired us to tackle the challenge of fairness by formulating the problem in a semi-supervised framework. In this paper, we propose a semi-supervised algorithm using neural networks benefiting from unlabeled data to not just improve the performance but also improve the fairness of the decision-making process. The proposed model, called SSFair, exploits the information in the unlabeled data to mitigate the bias in the training data.

preprint2017arXiv

LoPub: High-Dimensional Crowdsourced Data Publication with Local Differential Privacy

High-dimensional crowdsourced data collected from a large number of users produces rich knowledge for our society. However, it also brings unprecedented privacy threats to participants. Local privacy, a variant of differential privacy, is proposed as a means to eliminate the privacy concern. Unfortunately, achieving local privacy on high-dimensional crowdsourced data raises great challenges on both efficiency and effectiveness. Here, based on EM and Lasso regression, we propose efficient multi-dimensional joint distribution estimation algorithms with local privacy. Then, we develop a Locally privacy-preserving high-dimensional data Publication algorithm, LoPub, by taking advantage of our distribution estimation techniques. In particular, both correlations and joint distribution among multiple attributes can be identified to reduce the dimension of crowdsourced data, thus achieving both efficiency and effectiveness in locally private high-dimensional data publication. Extensive experiments on real-world datasets demonstrated that the efficiency of our multivariate distribution estimation scheme and confirm the effectiveness of our LoPub scheme in generating approximate datasets with local privacy.