Researcher profile

Yunjun Gao

Yunjun Gao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

All-in-one Graph-based Indexing for Hybrid Search on GPUs

Hybrid search has emerged as a promising paradigm that combines lexical and semantic retrieval, enhancing accuracy for applications such as recommendations, information retrieval, and Retrieval-Augmented Generation. However, existing methods are constrained by a trilemma: they sacrifice flexibility for efficiency, suffer from accuracy degradation, or incur prohibitive storage overhead for flexible combinations of retrieval paths. This paper introduces Allan-Poe, a novel all-in-one graph index accelerated by GPUs for efficient hybrid search. We first analyze the limitations of existing retrieval paradigms and extract key design principles for an effective hybrid index. Guided by the principles, we architect a unified graph-based index that flexibly integrates three retrieval paths (dense vector, sparse vector, and full-text) within a single, cohesive structure. To enable efficient construction, we design a GPU-accelerated pipeline featuring a warp-level hybrid distance kernel, RNG-IP joint pruning, and keyword-aware neighbor recycling. For query processing, we introduce a dynamic fusion framework that supports any combination of retrieval paths and weights without index reconstruction, flexibly leveraging logical structures from the knowledge graph to resolve complex multi-hop queries. Extensive experiments on 6 real-world datasets demonstrate that Allan-Poe achieves superior end-to-end query accuracy and outperforms state-of-the-art methods by 1.5x-186.4x in throughput, while significantly reducing storage overhead.

preprint2025arXiv

Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views

Vector Similarity Search (VSS) in high-dimensional spaces is rapidly emerging as core functionality in next-generation database systems for numerous data-intensive services -- from embedding lookups in large language models (LLMs), to semantic information retrieval and recommendation engines. Current benchmarks, however, evaluate VSS primarily on the recall-latency trade-off against a ground truth defined solely by distance metrics, neglecting how retrieval quality ultimately impacts downstream tasks. This disconnect can mislead both academic research and industrial practice. We present Iceberg, a holistic benchmark suite for end-to-end evaluation of VSS methods in realistic application contexts. From a task-centric view, Iceberg uncovers the Information Loss Funnel, which identifies three principal sources of end-to-end performance degradation: (1) Embedding Loss during feature extraction; (2) Metric Misuse, where distances poorly reflect task relevance; (3) Data Distribution Sensitivity, highlighting index robustness across skews and modalities. For a more comprehensive assessment, Iceberg spans eight diverse datasets across key domains such as image classification, face recognition, text retrieval, and recommendation systems. Each dataset, ranging from 1M to 100M vectors, includes rich, task-specific labels and evaluation metrics, enabling assessment of retrieval algorithms within the full application pipeline rather than in isolation. Iceberg benchmarks 13 state-of-the-art VSS methods and re-ranks them based on application-level metrics, revealing substantial deviations from traditional rankings derived purely from recall-latency evaluations. Building on these insights, we define a set of task-centric meta-features and derive an interpretable decision tree to guide practitioners in selecting and tuning VSS methods for their specific workloads.

preprint2024arXiv

View-based Explanations for Graph Neural Networks

Generating explanations for graph neural networks (GNNs) has been studied to understand their behavior in analytical tasks such as graph classification. Existing approaches aim to understand the overall results of GNNs rather than providing explanations for specific class labels of interest, and may return explanation structures that are hard to access, nor directly queryable.We propose GVEX, a novel paradigm that generates Graph Views for EXplanation. (1) We design a two-tier explanation structure called explanation views. An explanation view consists of a set of graph patterns and a set of induced explanation subgraphs. Given a database G of multiple graphs and a specific class label l assigned by a GNN-based classifier M, it concisely describes the fraction of G that best explains why l is assigned by M. (2) We propose quality measures and formulate an optimization problem to compute optimal explanation views for GNN explanation. We show that the problem is $Σ^2_P$-hard. (3) We present two algorithms. The first one follows an explain-and-summarize strategy that first generates high-quality explanation subgraphs which best explain GNNs in terms of feature influence maximization, and then performs a summarization step to generate patterns. We show that this strategy provides an approximation ratio of 1/2. Our second algorithm performs a single-pass to an input node stream in batches to incrementally maintain explanation views, having an anytime quality guarantee of 1/4 approximation. Using real-world benchmark data, we experimentally demonstrate the effectiveness, efficiency, and scalability of GVEX. Through case studies, we showcase the practical applications of GVEX.

preprint2022arXiv

ClusterEA: Scalable Entity Alignment with Stochastic Training and Normalized Mini-batch Similarities

Entity alignment (EA) aims at finding equivalent entities in different knowledge graphs (KGs). Embedding-based approaches have dominated the EA task in recent years. Those methods face problems that come from the geometric properties of embedding vectors, including hubness and isolation. To solve these geometric problems, many normalization approaches have been adopted for EA. However, the increasing scale of KGs renders it hard for EA models to adopt the normalization processes, thus limiting their usage in real-world applications. To tackle this challenge, we present ClusterEA, a general framework that is capable of scaling up EA models and enhancing their results by leveraging normalization methods on mini-batches with a high entity equivalent rate. ClusterEA contains three components to align entities between large-scale KGs, including stochastic training, ClusterSampler, and SparseFusion. It first trains a large-scale Siamese GNN for EA in a stochastic fashion to produce entity embeddings. Based on the embeddings, a novel ClusterSampler strategy is proposed for sampling highly overlapped mini-batches. Finally, ClusterEA incorporates SparseFusion, which normalizes local and global similarity and then fuses all similarity matrices to obtain the final similarity matrix. Extensive experiments with real-life datasets on EA benchmarks offer insight into the proposed framework, and suggest that it is capable of outperforming the state-of-the-art scalable EA framework by up to 8 times in terms of Hits@1.

preprint2022arXiv

Deep Spatially and Temporally Aware Similarity Computation for Road Network Constrained Trajectories

Trajectory similarity computation has drawn massive attention, as it is core functionality in a wide range of applications such as ride-sharing, traffic analysis, and social recommendation. Motivated by the recent success of deep learning technologies, researchers start devoting efforts to learning-based similarity analyses to overcome the limitations (i.e., high cost and poor adaptability) of traditional methods. Specifically, deep trajectory similarity computation aims to learn a distance function that can evaluate how similar two trajectories are via neural networks. However, existing learning-based methods focus on spatial similarity but ignore the time dimension of trajectories, which is suboptimal for time-aware applications. Besides, they tend to disregard the embedding of trajectories into road networks, restricting their applicability in real scenarios. In this paper, we propose an effective learning-based framework, called ST2Vec, to perform efficient spatially and temporally aware trajectory similarity computation in road networks. Finally, extensive experimental evaluation using three real trajectory data sets shows that ST2Vec outperforms all the state-of-the-art approaches substantially.

preprint2022arXiv

HAKG: Hierarchy-Aware Knowledge Gated Network for Recommendation

Knowledge graph (KG) plays an increasingly important role to improve the recommendation performance and interpretability. A recent technical trend is to design end-to-end models based on information propagation schemes. However, existing propagation-based methods fail to (1) model the underlying hierarchical structures and relations, and (2) capture the high-order collaborative signals of items for learning high-quality user and item representations. In this paper, we propose a new model, called Hierarchy-Aware Knowledge Gated Network (HAKG), to tackle the aforementioned problems. Technically, we model users and items (that are captured by a user-item graph), as well as entities and relations (that are captured in a KG) in hyperbolic space, and design a hyperbolic aggregation scheme to gather relational contexts over KG. Meanwhile, we introduce a novel angle constraint to preserve characteristics of items in the embedding space. Furthermore, we propose a dual item embeddings design to represent and propagate collaborative signals and knowledge associations separately, and leverage the gated aggregation to distill discriminative information for better capturing user behavior patterns. Experimental results on three benchmark datasets show that, HAKG achieves significant improvement over the state-of-the-art methods like CKAN, Hyper-Know, and KGIN. Further analyses on the learned hyperbolic embeddings confirm that HAKG offers meaningful insights into the hierarchies of data.

preprint2022arXiv

Indexing Metric Spaces for Exact Similarity Search

With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity, and variety. Many studies address volume or velocity, while fewer studies concern the variety. Metric spaces are ideal for addressing variety because they can accommodate any data as long as it can be equipped with a distance notion that satisfies the triangle inequality. To accelerate search in metric spaces, a collection of indexing techniques for metric data have been proposed. However, existing surveys offer limited coverage, and a comprehensive empirical study exists has yet to be reported. We offer a comprehensive survey of existing metric indexes that support exact similarity search: we summarize existing partitioning, pruning, and validation techniques used by metric indexes to support exact similarity search; we provide the time and space complexity analyses of index construction; and we offer an empirical comparison of their query processing performance. Empirical studies are important when evaluating metric indexing performance, because performance can depend highly on the effectiveness of available pruning and validation as well as on the data distribution, which means that complexity analyses often offer limited insights. This article aims at revealing strengths and weaknesses of different indexing techniques to offer guidance on selecting an appropriate indexing technique for a given setting, and to provide directions for future research on metric indexing.

preprint2022arXiv

Maximizing the Influence of Bichromatic Reverse k Nearest Neighbors in Geo-Social Networks

Geo-social networks offer opportunities for the marketing and promotion of geo-located services. In this setting, we explore a new problem, called Maximizing the Influence of Bichromatic Reverse k Nearest Neighbors (MaxInfBRkNN). The objective is to find a set of points of interest (POIs), which are geo-textually and socially attractive to social influencers who are expected to largely promote the POIs through online influence propagation. In other words, the problem aims to detect an optimal set of POIs with the largest word-of-mouth (WOM) marketing potential. This functionality is useful in various real-life applications, including social advertising, location-based viral marketing, and personalized POI recommendation. However, solving MaxInfBRkNN with theoretical guarantees is challenging, because of the prohibitive overheads on BRkNN retrieval in geo-social networks, and the NP and #P-hardness in finding the optimal POI set. To achieve practical solutions, we present a framework with carefully designed indexes, efficient batch BRkNN processing algorithms, and alternative POI selection policies that support both approximate and heuristic solutions. Extensive experiments on real and synthetic datasets demonstrate the good performance of our proposed methods.

preprint2022arXiv

MetaKG: Meta-learning on Knowledge Graph for Cold-start Recommendation

A knowledge graph (KG) consists of a set of interconnected typed entities and their attributes. Recently, KGs are popularly used as the auxiliary information to enable more accurate, explainable, and diverse user preference recommendations. Specifically, existing KG-based recommendation methods target modeling high-order relations/dependencies from long connectivity user-item interactions hidden in KG. However, most of them ignore the cold-start problems (i.e., user cold-start and item cold-start) of recommendation analytics, which restricts their performance in scenarios when involving new users or new items. Inspired by the success of meta-learning on scarce training samples, we propose a novel meta-learning based framework called MetaKG, which encompasses a collaborative-aware meta learner and a knowledge-aware meta learner, to capture meta users' preference and entities' knowledge for cold-start recommendations. The collaborative-aware meta learner aims to locally aggregate user preferences for each user preference learning task. In contrast, the knowledge-aware meta learner is to globally generalize knowledge representation across different user preference learning tasks. Guided by two meta learners, MetaKG can effectively capture the high-order collaborative relations and semantic representations, which could be easily adapted to cold-start scenarios. Besides, we devise a novel adaptive task scheduler which can adaptively select the informative tasks for meta learning in order to prevent the model from being corrupted by noisy tasks. Extensive experiments on various cold-start scenarios using three real data sets demonstrate that our presented MetaKG outperforms all the existing state-of-the-art competitors in terms of effectiveness, efficiency, and scalability.

preprint2022arXiv

PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching

Entity Matching (EM), which aims to identify whether two entity records from two relational tables refer to the same real-world entity, is one of the fundamental problems in data management. Traditional EM assumes that two tables are homogeneous with the aligned schema, while it is common that entity records of different formats (e.g., relational, semi-structured, or textual types) involve in practical scenarios. It is not practical to unify their schemas due to the different formats. To support EM on format-different entity records, Generalized Entity Matching (GEM) has been proposed and gained much attention recently. To do GEM, existing methods typically perform in a supervised learning way, which relies on a large amount of high-quality labeled examples. However, the labeling process is extremely labor-intensive, and frustrates the use of GEM. Low-resource GEM, i.e., GEM that only requires a small number of labeled examples, becomes an urgent need. To this end, this paper, for the first time, focuses on the low-resource GEM and proposes a novel low-resource GEM method, termed as PromptEM. PromptEM has addressed three challenging issues (i.e., designing GEM-specific prompt-tuning, improving pseudo-labels quality, and running efficient self-training) in low-resource GEM. Extensive experimental results on eight real benchmarks demonstrate the superiority of PromptEM in terms of effectiveness and efficiency.

preprint2022arXiv

Self-Guided Learning to Denoise for Robust Recommendation

The ubiquity of implicit feedback makes them the default choice to build modern recommender systems. Generally speaking, observed interactions are considered as positive samples, while unobserved interactions are considered as negative ones. However, implicit feedback is inherently noisy because of the ubiquitous presence of noisy-positive and noisy-negative interactions. Recently, some studies have noticed the importance of denoising implicit feedback for recommendations, and enhanced the robustness of recommendation models to some extent. Nonetheless, they typically fail to (1) capture the hard yet clean interactions for learning comprehensive user preference, and (2) provide a universal denoising solution that can be applied to various kinds of recommendation models. In this paper, we thoroughly investigate the memorization effect of recommendation models, and propose a new denoising paradigm, i.e., Self-Guided Denoising Learning (SGDL), which is able to collect memorized interactions at the early stage of the training (i.e., "noise-resistant" period), and leverage those data as denoising signals to guide the following training (i.e., "noise-sensitive" period) of the model in a meta-learning manner. Besides, our method can automatically switch its learning phase at the memorization point from memorization to self-guided learning, and select clean and informative memorized data via a novel adaptive denoising scheduler to improve the robustness. We incorporate SGDL with four representative recommendation models (i.e., NeuMF, CDAE, NGCF and LightGCN) and different loss functions (i.e., binary cross-entropy and BPR loss). The experimental results on three benchmark datasets demonstrate the effectiveness of SGDL over the state-of-the-art denoising methods like T-CE, IR, DeCA, and even state-of-the-art robust graph-based methods like SGCN and SGL.

preprint2022arXiv

Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation

Multi-frame human pose estimation has long been a compelling and fundamental problem in computer vision. This task is challenging due to fast motion and pose occlusion that frequently occur in videos. State-of-the-art methods strive to incorporate additional visual evidences from neighboring frames (supporting frames) to facilitate the pose estimation of the current frame (key frame). One aspect that has been obviated so far, is the fact that current methods directly aggregate unaligned contexts across frames. The spatial-misalignment between pose features of the current frame and neighboring frames might lead to unsatisfactory results. More importantly, existing approaches build upon the straightforward pose estimation loss, which unfortunately cannot constrain the network to fully leverage useful information from neighboring frames. To tackle these problems, we present a novel hierarchical alignment framework, which leverages coarse-to-fine deformations to progressively update a neighboring frame to align with the current frame at the feature level. We further propose to explicitly supervise the knowledge extraction from neighboring frames, guaranteeing that useful complementary cues are extracted. To achieve this goal, we theoretically analyzed the mutual information between the frames and arrived at a loss that maximizes the task-relevant mutual information. These allow us to rank No.1 in the Multi-frame Person Pose Estimation Challenge on benchmark dataset PoseTrack2017, and obtain state-of-the-art performance on benchmarks Sub-JHMDB and Pose-Track2018. Our code is released at https://github. com/Pose-Group/FAMI-Pose, hoping that it will be useful to the community.

preprint2020arXiv

CrowdTSC: Crowd-based Neural Networks for Text Sentiment Classification

Sentiment classification is a fundamental task in content analysis. Although deep learning has demonstrated promising performance in text classification compared with shallow models, it is still not able to train a satisfying classifier for text sentiment. Human beings are more sophisticated than machine learning models in terms of understanding and capturing the emotional polarities of texts. In this paper, we leverage the power of human intelligence into text sentiment classification. We propose Crowd-based neural networks for Text Sentiment Classification (CrowdTSC for short). We design and post the questions on a crowdsourcing platform to collect the keywords in texts. Sampling and clustering are utilized to reduce the cost of crowdsourcing. Also, we present an attention-based neural network and a hybrid neural network, which incorporate the collected keywords as human being's guidance into deep neural networks. Extensive experiments on public datasets confirm that CrowdTSC outperforms state-of-the-art models, justifying the effectiveness of crowd-based keyword guidance.

preprint2020arXiv

KGClean: An Embedding Powered Knowledge Graph Cleaning Framework

The quality assurance of the knowledge graph is a prerequisite for various knowledge-driven applications. We propose KGClean, a novel cleaning framework powered by knowledge graph embedding, to detect and repair the heterogeneous dirty data. In contrast to previous approaches that either focus on filling missing data or clean errors violated limited rules, KGClean enables (i) cleaning both missing data and other erroneous values, and (ii) mining potential rules automatically, which expands the coverage of error detecting. KGClean first learns data representations by TransGAT, an effective knowledge graph embedding model, which gathers the neighborhood information of each data and incorporates the interactions among data for casting data to continuous vector spaces with rich semantics. KGClean integrates an active learning-based classification model, which identifies errors with a small seed of labels. KGClean utilizes an efficient PRO-repair strategy to repair errors using a novel concept of propagation power. Extensive experiments on four typical knowledge graphs demonstrate the effectiveness of KGClean in practice.