Researcher profile

Richang Hong

Richang Hong contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
23works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

23 published item(s)

preprint2026arXiv

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.

preprint2026arXiv

FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent's capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents' capabilities in corporate financial analysis. This framework mirrors the professional analyst's workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code is publicly available at https://OpenFinArena.com/.

preprint2026arXiv

Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

Large language model (LLM)-based multi-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference-time token consumption, and limited scalability in multi-domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi-agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query-conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per-query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology-evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query-conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query-relevant structural variation. Experiments on multi-domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology-evolution backbones while reducing online inference-time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi-agent LLM collaboration across domains.

preprint2026arXiv

Post-hoc Provider Fairness Adaptation via Hierarchical Exposure Alignment

Provider exposure fairness is crucial for sustaining a healthy content ecosystem and preventing monopolization in recommender systems. Yet, most existing methods either incorporate fairness constraints during model training, requiring expensive retraining when fairness objectives change, or rely on post-hoc reranking with fixed criteria, which lacks adaptability to diverse fairness requirements. To overcome these limitations, we propose Post-hoc Fairness Adaptation (PFA), a lightweight framework that equips a frozen recommender with a fairness adapter, enabling flexible fairness control without retraining the backbone model. Specifically, the fairness adapter learns personalized additive score adjustments from user-item embeddings, which are injected into the original ranking scores to steer provider exposure toward fairness. To train the adapter, we minimize the KL divergence between the actual and the target fair exposure distributions. However, this global objective implicitly treats all providers equally, ignoring structural disparities such as imbalanced provider group sizes and heterogeneous exposure within groups. Consequently, fairness may appear satisfied at an aggregate level while severe inter-group and intra-group exposure imbalances persist, undermining practical fairness. To address this, we design Hierarchical Exposure Fairness Alignment (HEFA), which explicitly balances inter- and intra-group provider exposure disparities, enabling flexible adaptation to diverse fairness requirements. To mitigate potential accuracy degradation, PFA jointly optimizes HEFA with a differentiable NDCG loss, enabling end-to-end fairness optimization while preserving ranking quality. Extensive experiments on three public datasets demonstrate that PFA achieves substantial fairness gains with negligible accuracy loss, consistently outperforming strong baselines.

preprint2026arXiv

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for ``zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into ``Active'', ``Standby'', and ``Terminated'' states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing ``Terminated'' nodes and retaining ``Standby'' nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.

preprint2026arXiv

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-modal calibration, while severe conflicts arise from intrinsically contradictory (e.g., sarcasm) or misleading signals, for which forced fusion may amplify errors. Recognizing this, we propose Dual-Path Conflict Resolution (DCR), a unified framework that learns when to fuse and when to drop modalities. Path I (Affective Fusion Distiller, AFD) performs reverse distillation from audio/visual teachers to a textual student using temporally weighted class evidence, thereby enhancing representation-level calibration and improving fusion when alignment is beneficial. Path II (Affective Discernment Agent, ADA) formulates MER as a contextual bandit that selects among fusion and unimodal predictions based on a dual-view state and a calibration-aware reward, enabling decision-level arbitration under irreconcilable conflicts without requiring per-modality reliability labels. By taking into account the full multimodal context and coupling soft calibration with hard arbitration, DCR reconciles conflicts that can be aligned while bypassing misleading modalities when fusion is harmful. Across five benchmarks covering both dialogue-level and clip-level MER, DCR consistently outperforms competitive baselines or achieves highly competitive results. Further ablations, conflict-specific subset evaluation, and modality-selection analysis verify that AFD and ADA are complementary and jointly improve robust conflict-aware emotion recognition.

preprint2022arXiv

A Review-aware Graph Contrastive Learning Framework for Recommendation

Most modern recommender systems predict users preferences with two components: user and item embedding learning, followed by the user-item interaction modeling. By utilizing the auxiliary review information accompanied with user ratings, many of the existing review-based recommendation models enriched user/item embedding learning ability with historical reviews or better modeled user-item interactions with the help of available user-item target reviews. Though significant progress has been made, we argue that current solutions for review-based recommendation suffer from two drawbacks. First, as review-based recommendation can be naturally formed as a user-item bipartite graph with edge features from corresponding user-item reviews, how to better exploit this unique graph structure for recommendation? Second, while most current models suffer from limited user behaviors, can we exploit the unique self-supervised signals in the review-aware graph to guide two recommendation components better? To this end, in this paper, we propose a novel Review-aware Graph Contrastive Learning (RGCL) framework for review-based recommendation. Specifically, we first construct a review-aware user-item graph with feature-enhanced edges from reviews, where each edge feature is composed of both the user-item rating and the corresponding review semantics. This graph with feature-enhanced edges can help attentively learn each neighbor node weight for user and item representation learning. After that, we design two additional contrastive learning tasks (i.e., Node Discrimination and Edge Discrimination) to provide self-supervised signals for the two components in recommendation process. Finally, extensive experiments over five benchmark datasets demonstrate the superiority of our proposed RGCL compared to the state-of-the-art baselines.

preprint2022arXiv

A Topic-Attentive Transformer-based Model For Multimodal Depression Detection

Depression is one of the most common mental disorders, which imposes heavy negative impacts on one's daily life. Diagnosing depression based on the interview is usually in the form of questions and answers. In this process, the audio signals and their text transcripts of a subject are correlated to depression cues and easily recorded. Therefore, it is feasible to build an Automatic Depression Detection (ADD) model based on the data of these modalities in practice. However, there are two major challenges that should be addressed for constructing an effective ADD model. The first challenge is the organization of the textual and audio data, which can be of various contents and lengths for different subjects. The second challenge is the lack of training samples due to the privacy concern. Targeting to these two challenges, we propose the TOpic ATtentive transformer-based ADD model, abbreviated as TOAT. To address the first challenge, in the TOAT model, topic is taken as the basic unit of the textual and audio data according to the question-answer form in a typical interviewing process. Based on that, a topic attention module is designed to learn the importance of of each topic, which helps the model better retrieve the depressed samples. To solve the issue of data scarcity, we introduce large pre-trained models, and the fine-tuning strategy is adopted based on the small-scale ADD training data. We also design a two-branch architecture with a late-fusion strategy for building the TOAT model, in which the textual and audio data are encoded independently. We evaluate our model on the multimodal DAIC-WOZ dataset specifically designed for the ADD task. Experimental results show the superiority of our method. More importantly, the ablation studies demonstrate the effectiveness of the key elements in the TOAT model.

preprint2022arXiv

Automatic Depression Detection via Learning and Fusing Features from Visual Cues

Depression is one of the most prevalent mental disorders, which seriously affects one's life. Traditional depression diagnostics commonly depends on rating with scales, which can be labor-intensive and subjective. In this context, Automatic Depression Detection (ADD) has been attracting more attention for its low cost and objectivity. ADD systems are able to detect depression automatically from some medical records, like video sequences. However, it remains challenging to effectively extract depression-specific information from long sequences, thereby hindering a satisfying accuracy. In this paper, we propose a novel ADD method via learning and fusing features from visual cues. Specifically, we firstly construct Temporal Dilated Convolutional Network (TDCN), in which multiple Dilated Convolution Blocks (DCB) are designed and stacked, to learn the long-range temporal information from sequences. Then, the Feature-Wise Attention (FWA) module is adopted to fuse different features extracted from TDCNs. The module learns to assign weights for the feature channels, aiming to better incorporate different kinds of visual features and further enhance the detection accuracy. Our method achieves the state-of-the-art performance on the DAIC_WOZ dataset compared to other visual-feature-based methods, showing its effectiveness.

preprint2022arXiv

Real-time Semantic Segmentation via Spatial-detail Guided Context Propagation

Nowadays, vision-based computing tasks play an important role in various real-world applications. However, many vision computing tasks, e.g. semantic segmentation, are usually computationally expensive, posing a challenge to the computing systems that are resource-constrained but require fast response speed. Therefore, it is valuable to develop accurate and real-time vision processing models that only require limited computational resources. To this end, we propose the Spatial-detail Guided Context Propagation Network (SGCPNet) for achieving real-time semantic segmentation. In SGCPNet, we propose the strategy of spatial-detail guided context propagation. It uses the spatial details of shallow layers to guide the propagation of the low-resolution global contexts, in which the lost spatial information can be effectively reconstructed. In this way, the need for maintaining high-resolution features along the network is freed, therefore largely improving the model efficiency. On the other hand, due to the effective reconstruction of spatial details, the segmentation accuracy can be still preserved. In the experiments, we validate the effectiveness and efficiency of the proposed SGCPNet model. On the Citysacpes dataset, for example, our SGCPNet achieves 69.5% mIoU segmentation accuracy, while its speed reaches 178.5 FPS on 768x1536 images on a GeForce GTX 1080 Ti GPU card. In addition, SGCPNet is very lightweight and only contains 0.61 M parameters.

preprint2022arXiv

Revisiting Local Descriptor for Improved Few-Shot Classification

Few-shot classification studies the problem of quickly adapting a deep learner to understanding novel classes based on few support images. In this context, recent research efforts have been aimed at designing more and more complex classifiers that measure similarities between query and support images, but left the importance of feature embeddings seldom explored. We show that the reliance on sophisticated classifiers is not necessary, and a simple classifier applied directly to improved feature embeddings can instead outperform most of the leading methods in the literature. To this end, we present a new method named \textbf{DCAP} for few-shot classification, in which we investigate how one can improve the quality of embeddings by leveraging \textbf{D}ense \textbf{C}lassification and \textbf{A}ttentive \textbf{P}ooling. Specifically, we propose to train a learner on base classes with abundant samples to solve dense classification problem first and then meta-train the learner on a bunch of randomly sampled few-shot tasks to adapt it to few-shot scenario or the test time scenario. During meta-training, we suggest to pool feature maps by applying attentive pooling instead of the widely used global average pooling (GAP) to prepare embeddings for few-shot classification. Attentive pooling learns to reweight local descriptors, explaining what the learner is looking for as evidence for decision making. Experiments on two benchmark datasets show the proposed method to be superior in multiple few-shot settings while being simpler and more explainable. Code is available at: \url{https://github.com/Ukeyboard/dcap/}.

preprint2022arXiv

Switchable Online Knowledge Distillation

Online Knowledge Distillation (OKD) improves the involved models by reciprocally exploiting the difference between teacher and student. Several crucial bottlenecks over the gap between them -- e.g., Why and when does a large gap harm the performance, especially for student? How to quantify the gap between teacher and student? -- have received limited formal study. In this paper, we propose Switchable Online Knowledge Distillation (SwitOKD), to answer these questions. Instead of focusing on the accuracy gap at test phase by the existing arts, the core idea of SwitOKD is to adaptively calibrate the gap at training phase, namely distillation gap, via a switching strategy between two modes -- expert mode (pause the teacher while keep the student learning) and learning mode (restart the teacher). To possess an appropriate distillation gap, we further devise an adaptive switching threshold, which provides a formal criterion as to when to switch to learning mode or expert mode, and thus improves the student's performance. Meanwhile, the teacher benefits from our adaptive switching threshold and keeps basically on a par with other online arts. We further extend SwitOKD to multiple networks with two basis topologies. Finally, extensive experiments and analysis validate the merits of SwitOKD for classification over the state-of-the-arts. Our code is available at https://github.com/hfutqian/SwitOKD.

preprint2021arXiv

DiffNet++: A Neural Influence and Interest Diffusion Network for Social Recommendation

Social recommendation has emerged to leverage social connections among users for predicting users' unknown preferences, which could alleviate the data sparsity issue in collaborative filtering based recommendation. Early approaches relied on utilizing each user's first-order social neighbors' interests for better user modeling and failed to model the social influence diffusion process from the global social network structure. Recently, we propose a preliminary work of a neural influence diffusion network (i.e., DiffNet) for social recommendation (Diffnet), which models the recursive social diffusion process to capture the higher-order relationships for each user. However, we argue that, as users play a central role in both user-user social network and user-item interest network, only modeling the influence diffusion process in the social network would neglect the users' latent collaborative interests in the user-item interest network. In this paper, we propose DiffNet++, an improved algorithm of DiffNet that models the neural influence diffusion and interest diffusion in a unified framework. By reformulating the social recommendation as a heterogeneous graph with social network and interest network as input, DiffNet++ advances DiffNet by injecting these two network information for user embedding learning at the same time. This is achieved by iteratively aggregating each user's embedding from three aspects: the user's previous embedding, the influence aggregation of social neighbors from the social network, and the interest aggregation of item neighbors from the user-item interest network. Furthermore, we design a multi-level attention network that learns how to attentively aggregate user embeddings from these three aspects. Finally, extensive experimental results on two real-world datasets clearly show the effectiveness of our proposed model.

preprint2020arXiv

Creating Something from Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing

In recent years, cross-modal hashing (CMH) has attracted increasing attentions, mainly because its potential ability of mapping contents from different modalities, especially in vision and language, into the same space, so that it becomes efficient in cross-modal data retrieval. There are two main frameworks for CMH, differing from each other in whether semantic supervision is required. Compared to the unsupervised methods, the supervised methods often enjoy more accurate results, but require much heavier labors in data annotation. In this paper, we propose a novel approach that enables guiding a supervised method using outputs produced by an unsupervised method. Specifically, we make use of teacher-student optimization for propagating knowledge. Experiments are performed on two popular CMH benchmarks, i.e., the MIRFlickr and NUS-WIDE datasets. Our approach outperforms all existing unsupervised methods by a large margin.

preprint2020arXiv

Estimation-Action-Reflection: Towards Deep Interaction Between Conversational and Recommender Systems

Recommender systems are embracing conversational technologies to obtain user preferences dynamically, and to overcome inherent limitations of their static models. A successful Conversational Recommender System (CRS) requires proper handling of interactions between conversation and recommendation. We argue that three fundamental problems need to be solved: 1) what questions to ask regarding item attributes, 2) when to recommend items, and 3) how to adapt to the users' online feedback. To the best of our knowledge, there lacks a unified framework that addresses these problems. In this work, we fill this missing interaction framework gap by proposing a new CRS framework named Estimation-Action-Reflection, or EAR, which consists of three stages to better converse with users. (1) Estimation, which builds predictive models to estimate user preference on both items and item attributes; (2) Action, which learns a dialogue policy to determine whether to ask attributes or recommend items, based on Estimation stage and conversation history; and (3) Reflection, which updates the recommender model when a user rejects the recommendations made by the Action stage. We present two conversation scenarios on binary and enumerated questions, and conduct extensive experiments on two datasets from Yelp and LastFM, for each scenario, respectively. Our experiments demonstrate significant improvements over the state-of-the-art method CRM [32], corresponding to fewer conversation turns and a higher level of recommendation hits.

preprint2020arXiv

Joint Item Recommendation and Attribute Inference: An Adaptive Graph Convolutional Network Approach

In many recommender systems, users and items are associated with attributes, and users show preferences to items. The attribute information describes users'(items') characteristics and has a wide range of applications, such as user profiling, item annotation, and feature-enhanced recommendation. As annotating user (item) attributes is a labor intensive task, the attribute values are often incomplete with many missing attribute values. Therefore, item recommendation and attribute inference have become two main tasks in these platforms. Researchers have long converged that user (item) attributes and the preference behavior are highly correlated. Some researchers proposed to leverage one kind of data for the remaining task, and showed to improve performance. Nevertheless, these models either neglected the incompleteness of user (item) attributes or regarded the correlation of the two tasks with simple models, leading to suboptimal performance of these two tasks. To this end, in this paper, we define these two tasks in an attributed user-item bipartite graph, and propose an Adaptive Graph Convolutional Network (AGCN) approach for joint item recommendation and attribute inference. The key idea of AGCN is to iteratively perform two parts: 1) Learning graph embedding parameters with previously learned approximated attribute values to facilitate two tasks; 2) Sending the approximated updated attribute values back to the attributed graph for better graph embedding learning. Therefore, AGCN could adaptively adjust the graph embedding learning parameters by incorporating both the given attributes and the estimated attribute values, in order to provide weakly supervised information to refine the two tasks. Extensive experimental results on three real-world datasets clearly show the effectiveness of the proposed model.

preprint2020arXiv

Learning to Transfer Graph Embeddings for Inductive Graph based Recommendation

With the increasing availability of videos, how to edit them and present the most interesting parts to users, i.e., video highlight, has become an urgent need with many broad applications. As users'visual preferences are subjective and vary from person to person, previous generalized video highlight extraction models fail to tailor to users' unique preferences. In this paper, we study the problem of personalized video highlight recommendation with rich visual content. By dividing each video into non-overlapping segments, we formulate the problem as a personalized segment recommendation task with many new segments in the test stage. The key challenges of this problem lie in: the cold-start users with limited video highlight records in the training data and new segments without any user ratings at the test stage. In this paper, we propose an inductive Graph based Transfer learning framework for personalized video highlight Recommendation (TransGRec). TransGRec is composed of two parts: a graph neural network followed by an item embedding transfer network. Specifically, the graph neural network part exploits the higher-order proximity between users and segments to alleviate the user cold-start problem. The transfer network is designed to approximate the learned item embeddings from graph neural networks by taking each item's visual content as input, in order to tackle the new segment problem in the test phase. We design two detailed implementations of the transfer learning optimization function, and we show how the two parts of TransGRec can be efficiently optimized with different transfer learning optimization functions. Extensive experimental results on a real-world dataset clearly show the effectiveness of our proposed model.

preprint2020arXiv

Memory-Augmented Relation Network for Few-Shot Learning

Metric-based few-shot learning methods concentrate on learning transferable feature embedding that generalizes well from seen categories to unseen categories under the supervision of limited number of labelled instances. However, most of them treat each individual instance in the working context separately without considering its relationships with the others. In this work, we investigate a new metric-learning method, Memory-Augmented Relation Network (MRN), to explicitly exploit these relationships. In particular, for an instance, we choose the samples that are visually similar from the working context, and perform weighted information propagation to attentively aggregate helpful information from the chosen ones to enhance its representation. In MRN, we also formulate the distance metric as a learnable relation module which learns to compare for similarity measurement, and augment the working context with memory slots, both contributing to its generality. We empirically demonstrate that MRN yields significant improvement over its ancestor and achieves competitive or even better performance when compared with other few-shot learning approaches on the two major benchmark datasets, i.e. miniImagenet and tieredImagenet.

preprint2020arXiv

Personalized Multimedia Item and Key Frame Recommendation

When recommending or advertising items to users, an emerging trend is to present each multimedia item with a key frame image (e.g., the poster of a movie). As each multimedia item can be represented as multiple fine-grained visual images (e.g., related images of the movie), personalized key frame recommendation is necessary in these applications to attract users' unique visual preferences. However, previous personalized key frame recommendation models relied on users' fine-grained image behavior of multimedia items (e.g., user-image interaction behavior), which is often not available in real scenarios. In this paper, we study the general problem of joint multimedia item and key frame recommendation in the absence of the fine-grained user-image behavior. We argue that the key challenge of this problem lies in discovering users' visual profiles for key frame recommendation, as most recommendation models would fail without any users' fine-grained image behavior. To tackle this challenge, we leverage users' item behavior by projecting users (items) in two latent spaces: a collaborative latent space and a visual latent space. We further design a model to discern both the collaborative and visual dimensions of users, and model how users make decisive item preferences from these two spaces. As a result, the learned user visual profiles could be directly applied for key frame recommendation. Finally, experimental results on a real-world dataset clearly show the effectiveness of our proposed model on the two recommendation tasks.

preprint2020arXiv

Real-world Person Re-Identification via Degradation Invariance Learning

Person re-identification (Re-ID) in real-world scenarios usually suffers from various degradation factors, e.g., low-resolution, weak illumination, blurring and adverse weather. On the one hand, these degradations lead to severe discriminative information loss, which significantly obstructs identity representation learning; on the other hand, the feature mismatch problem caused by low-level visual variations greatly reduces retrieval performance. An intuitive solution to this problem is to utilize low-level image restoration methods to improve the image quality. However, existing restoration methods cannot directly serve to real-world Re-ID due to various limitations, e.g., the requirements of reference samples, domain gap between synthesis and reality, and incompatibility between low-level and high-level methods. In this paper, to solve the above problem, we propose a degradation invariance learning framework for real-world person Re-ID. By introducing a self-supervised disentangled representation learning strategy, our method is able to simultaneously extract identity-related robust features and remove real-world degradations without extra supervision. We use low-resolution images as the main demonstration, and experiments show that our approach is able to achieve state-of-the-art performance on several Re-ID benchmarks. In addition, our framework can be easily extended to other real-world degradation factors, such as weak illumination, with only a few modifications.

preprint2020arXiv

Revisiting Graph based Collaborative Filtering: A Linear Residual Graph Convolutional Network Approach

Graph Convolutional Networks (GCNs) are state-of-the-art graph based representation learning models by iteratively stacking multiple layers of convolution aggregation operations and non-linear activation operations. Recently, in Collaborative Filtering (CF) based Recommender Systems (RS), by treating the user-item interaction behavior as a bipartite graph, some researchers model higher-layer collaborative signals with GCNs. These GCN based recommender models show superior performance compared to traditional works. However, these models suffer from training difficulty with non-linear activations for large user-item graphs. Besides, most GCN based models could not model deeper layers due to the over smoothing effect with the graph convolution operation. In this paper, we revisit GCN based CF models from two aspects. First, we empirically show that removing non-linearities would enhance recommendation performance, which is consistent with the theories in simple graph convolutional networks. Second, we propose a residual network structure that is specifically designed for CF with user-item interaction modeling, which alleviates the over smoothing problem in graph convolution aggregation operation with sparse user-item interaction data. The proposed model is a linear model and it is easy to train, scale to large datasets, and yield better efficiency and effectiveness on two real datasets. We publish the source code at https://github.com/newlei/LRGCCF.

preprint2020arXiv

RGCF: Refined Graph Convolution Collaborative Filtering with concise and expressive embedding

Graph Convolution Network (GCN) has attracted significant attention and become the most popular method for learning graph representations. In recent years, many efforts have been focused on integrating GCN into the recommender tasks and have made remarkable progress. At its core is to explicitly capture high-order connectivities between the nodes in user-item bipartite graph. However, we theoretically and empirically find an inherent drawback existed in these GCN-based recommendation methods, where GCN is directly applied to aggregate neighboring nodes will introduce noise and information redundancy. Consequently, the these models' capability of capturing high-order connectivities among different nodes is limited, leading to suboptimal performance of the recommender tasks. The main reason is that the the nonlinear network layer inside GCN structure is not suitable for extracting non-sematic features(such as one-hot ID feature) in the collaborative filtering scenarios. In this work, we develop a new GCN-based Collaborative Filtering model, named Refined Graph convolution Collaborative Filtering(RGCF), where the construction of the embeddings of users (items) are delicately redesigned from several aspects during the aggregation on the graph. Compared to the state-of-the-art GCN-based recommendation, RGCF is more capable for capturing the implicit high-order connectivities inside the graph and the resultant vector representations are more expressive. We conduct extensive experiments on three public million-size datasets, demonstrating that our RGCF significantly outperforms state-of-the-art models. We release our code at https://github.com/hfutmars/RGCF.

preprint2019arXiv

A Hierarchical Attention Model for Social Contextual Image Recommendation

Image based social networks are among the most popular social networking services in recent years. With tremendous images uploaded everyday, understanding users' preferences on user-generated images and making recommendations have become an urgent need. In fact, many hybrid models have been proposed to fuse various kinds of side information~(e.g., image visual representation, social network) and user-item historical behavior for enhancing recommendation performance. However, due to the unique characteristics of the user generated images in social image platforms, the previous studies failed to capture the complex aspects that influence users' preferences in a unified framework. Moreover, most of these hybrid models relied on predefined weights in combining different kinds of information, which usually resulted in sub-optimal recommendation performance. To this end, in this paper, we develop a hierarchical attention model for social contextual image recommendation. In addition to basic latent user interest modeling in the popular matrix factorization based recommendation, we identify three key aspects (i.e., upload history, social influence, and owner admiration) that affect each user's latent preferences, where each aspect summarizes a contextual factor from the complex relationships between users and images. After that, we design a hierarchical attention network that naturally mirrors the hierarchical relationship (elements in each aspects level, and the aspect level) of users' latent interests with the identified key aspects. Specifically, by taking embeddings from state-of-the-art deep learning models that are tailored for each kind of data, the hierarchical attention network could learn to attend differently to more or less content. Finally, extensive experimental results on real-world datasets clearly show the superiority of our proposed model.