Source author record

Xifeng Yan

Xifeng Yan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Artificial Intelligence Computation and Language Machine Learning Social and Information Networks Data Structures and Algorithms Discrete Mathematics Information Retrieval math.CO physics.soc-ph

Catalog footprint

What is connected

13works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Can Editing LLMs Inject Harm?

Large Language Models (LLMs) have emerged as a new information channel. Meanwhile, one critical but under-explored question is: Is it possible to bypass the safety alignment and inject harmful information into LLMs stealthily? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the first risk, we find that editing attacks can inject both commonsense and long-tail misinformation into LLMs, and the effectiveness for the former one is particularly high. For the second risk, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can degrade the overall fairness. Then, we further illustrate the high stealthiness of editing attacks. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.

preprint2022arXiv

Composite Re-Ranking for Efficient Document Search with BERT

Although considerable efforts have been devoted to transformer-based ranking models for document search, the relevance-efficiency tradeoff remains a critical problem for ad-hoc ranking. To overcome this challenge, this paper presents BECR (BERT-based Composite Re-Ranking), a composite re-ranking scheme that combines deep contextual token interactions and traditional lexical term-matching features. In particular, BECR exploits a token encoding mechanism to decompose the query representations into pre-computable uni-grams and skip-n-grams. By applying token encoding on top of a dual-encoder architecture, BECR separates the attentions between a query and a document while capturing the contextual semantics of a query. In contrast to previous approaches, this framework does not perform expensive BERT computations during online inference. Thus, it is significantly faster, yet still able to achieve high competitiveness in ad-hoc ranking relevance. Finally, an in-depth comparison between BECR and other start-of-the-art neural ranking baselines is described using the TREC datasets, thereby further demonstrating the enhanced relevance and efficiency of BECR.

preprint2022arXiv

Limitations of Language Models in Arithmetic and Symbolic Induction

Recent work has shown that large pretrained Language Models (LMs) can not only perform remarkably well on a range of Natural Language Processing (NLP) tasks but also start improving on reasoning tasks such as arithmetic induction, symbolic manipulation, and commonsense reasoning with increasing size of models. However, it is still unclear what the underlying capabilities of these LMs are. Surprisingly, we find that these models have limitations on certain basic symbolic manipulation tasks such as copy, reverse, and addition. When the total number of symbols or repeating symbols increases, the model performance drops quickly. We investigate the potential causes behind this phenomenon and examine a set of possible methods, including explicit positional markers, fine-grained computation steps, and LMs with callable programs. Experimental results show that none of these techniques can solve the simplest addition induction problem completely. In the end, we introduce LMs with tutor, which demonstrates every single step of teaching. LMs with tutor is able to deliver 100% accuracy in situations of OOD and repeating symbols, shedding new insights on the boundary of large LMs in induction.

preprint2021arXiv

Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases

Existing studies on question answering on knowledge bases (KBQA) mainly operate with the standard i.i.d assumption, i.e., training distribution over questions is the same as the test distribution. However, i.i.d may be neither reasonably achievable nor desirable on large-scale KBs because 1) true user distribution is hard to capture and 2) randomly sample training examples from the enormous space would be highly data-inefficient. Instead, we suggest that KBQA models should have three levels of built-in generalization: i.i.d, compositional, and zero-shot. To facilitate the development of KBQA models with stronger generalization, we construct and release a new large-scale, high-quality dataset with 64,331 questions, GrailQA, and provide evaluation settings for all three levels of generalization. In addition, we propose a novel BERT-based KBQA model. The combination of our dataset and model enables us to thoroughly examine and demonstrate, for the first time, the key role of pre-trained contextual embeddings like BERT in the generalization of KBQA.

preprint2020arXiv

Adaptive-Step Graph Meta-Learner for Few-Shot Graph Classification

Graph classification aims to extract accurate information from graph-structured data for classification and is becoming more and more important in graph learning community. Although Graph Neural Networks (GNNs) have been successfully applied to graph classification tasks, most of them overlook the scarcity of labeled graph data in many applications. For example, in bioinformatics, obtaining protein graph labels usually needs laborious experiments. Recently, few-shot learning has been explored to alleviate this problem with only given a few labeled graph samples of test classes. The shared sub-structures between training classes and test classes are essential in few-shot graph classification. Exiting methods assume that the test classes belong to the same set of super-classes clustered from training classes. However, according to our observations, the label spaces of training classes and test classes usually do not overlap in real-world scenario. As a result, the existing methods don't well capture the local structures of unseen test classes. To overcome the limitation, in this paper, we propose a direct method to capture the sub-structures with well initialized meta-learner within a few adaptation steps. More specifically, (1) we propose a novel framework consisting of a graph meta-learner, which uses GNNs based modules for fast adaptation on graph data, and a step controller for the robustness and generalization of meta-learner; (2) we provide quantitative analysis for the framework and give a graph-dependent upper bound of the generalization error based on our framework; (3) the extensive experiments on real-world datasets demonstrate that our framework gets state-of-the-art results on several few-shot graph classification tasks compared to baselines.

preprint2020arXiv

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length $L$, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only $O(L(\log L)^{2})$ memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

preprint2015arXiv

Behavior Query Discovery in System-Generated Temporal Graphs

Computer system monitoring generates huge amounts of logs that record the interaction of system entities. How to query such data to better understand system behaviors and identify potential system risks and malicious behaviors becomes a challenging task for system administrators due to the dynamics and heterogeneity of the data. System monitoring data are essentially heterogeneous temporal graphs with nodes being system entities and edges being their interactions over time. Given the complexity of such graphs, it becomes time-consuming for system administrators to manually formulate useful queries in order to examine abnormal activities, attacks, and vulnerabilities in computer systems. In this work, we investigate how to query temporal graphs and treat query formulation as a discriminative temporal graph pattern mining problem. We introduce TGMiner to mine discriminative patterns from system logs, and these patterns can be taken as templates for building more complex queries. TGMiner leverages temporal information in graphs to prune graph patterns that share similar growth trend without compromising pattern quality. Experimental results on real system data show that TGMiner is 6-32 times faster than baseline methods. The discovered patterns were verified by system experts; they achieved high precision (97%) and recall (91%).

preprint2015arXiv

Observability of Lattice Graphs

We consider a graph observability problem: how many edge colors are needed for an unlabeled graph so that an agent, walking from node to node, can uniquely determine its location from just the observed color sequence of the walk? Specifically, let G(n,d) be an edge-colored subgraph of d-dimensional (directed or undirected) lattice of size n^d = n * n * ... * n. We say that G(n,d) is t-observable if an agent can uniquely determine its current position in the graph from the color sequence of any t-dimensional walk, where the dimension is the number of different directions spanned by the edges of the walk. A walk in an undirected lattice G(n,d) has dimension between 1 and d, but a directed walk can have dimension between 1 and 2d because of two different orientations for each axis. We derive bounds on the number of colors needed for t-observability. Our main result is that Theta(n^(d/t)) colors are both necessary and sufficient for t-observability of G(n,d), where d is considered a constant. This shows an interesting dependence of graph observability on the ratio between the dimension of the lattice and that of the walk. In particular, the number of colors for full-dimensional walks is Theta(n^(1/2)) in the directed case, and Theta(n) in the undirected case, independent of the lattice dimension. All of our results extend easily to non-square lattices: given a lattice graph of size N = n_1 * n_2 * ... * n_d, the number of colors for t-observability is Theta (N^(1/t)).

preprint2013arXiv

Querying Knowledge Graphs by Example Entity Tuples

We witness an unprecedented proliferation of knowledge graphs that record millions of entities and their relationships. While knowledge graphs are structure-flexible and content rich, they are difficult to use. The challenge lies in the gap between their overwhelming complexity and the limited database knowledge of non-professional users. If writing structured queries over simple tables is difficult, complex graphs are only harder to query. As an initial step toward improving the usability of knowledge graphs, we propose to query such data by example entity tuples, without requiring users to form complex graph queries. Our system, GQBE (Graph Query By Example), automatically derives a weighted hidden maximal query graph based on input query tuples, to capture a user's query intent. It efficiently finds and ranks the top approximate answer tuples. For fast query processing, GQBE only partially evaluates query graphs. We conducted experiments and user studies on the large Freebase and DBpedia datasets and observed appealing accuracy and efficiency. Our system provides a complementary approach to the existing keyword-based methods, facilitating user-friendly graph querying. To the best of our knowledge, there was no such proposal in the past in the context of graphs.

preprint2012arXiv

Inferring the Underlying Structure of Information Cascades

In social networks, information and influence diffuse among users as cascades. While the importance of studying cascades has been recognized in various applications, it is difficult to observe the complete structure of cascades in practice. Moreover, much less is known on how to infer cascades based on partial observations. In this paper we study the cascade inference problem following the independent cascade model, and provide a full treatment from complexity to algorithms: (a) We propose the idea of consistent trees as the inferred structures for cascades; these trees connect source nodes and observed nodes with paths satisfying the constraints from the observed temporal information. (b) We introduce metrics to measure the likelihood of consistent trees as inferred cascades, as well as several optimization problems for finding them. (c) We show that the decision problems for consistent trees are in general NP-complete, and that the optimization problems are hard to approximate. (d) We provide approximation algorithms with performance guarantees on the quality of the inferred cascades, as well as heuristics. We experimentally verify the efficiency and effectiveness of our inference algorithms, using real and synthetic data.

preprint2012arXiv

MaTrust: An Effective Multi-Aspect Trust Inference Model

Trust is a fundamental concept in many real-world applications such as e-commerce and peer-to-peer networks. In these applications, users can generate local opinions about the counterparts based on direct experiences, and these opinions can then be aggregated to build trust among unknown users. The mechanism to build new trust relationships based on existing ones is referred to as trust inference. State-of-the-art trust inference approaches employ the transitivity property of trust by propagating trust along connected users. In this paper, we propose a novel trust inference model (MaTrust) by exploring an equally important property of trust, i.e., the multi-aspect property. MaTrust directly characterizes multiple latent factors for each trustor and trustee from the locally-generated trust relationships. Furthermore, it can naturally incorporate prior knowledge as specified factors. These factors in turn serve as the basis to infer the unseen trustworthiness scores. Experimental evaluations on real data sets show that the proposed MaTrust significantly outperforms several benchmark trust inference models in both effectiveness and efficiency.

preprint2012arXiv

Measuring Two-Event Structural Correlations on Graphs

Real-life graphs usually have various kinds of events happening on them, e.g., product purchases in online social networks and intrusion alerts in computer networks. The occurrences of events on the same graph could be correlated, exhibiting either attraction or repulsion. Such structural correlations can reveal important relationships between different events. Unfortunately, correlation relationships on graph structures are not well studied and cannot be captured by traditional measures. In this work, we design a novel measure for assessing two-event structural correlations on graphs. Given the occurrences of two events, we choose uniformly a sample of "reference nodes" from the vicinity of all event nodes and employ the Kendall's tau rank correlation measure to compute the average concordance of event density changes. Significance can be efficiently assessed by tau's nice property of being asymptotically normal under the null hypothesis. In order to compute the measure in large scale networks, we develop a scalable framework using different sampling strategies. The complexity of these strategies is analyzed. Experiments on real graph datasets with both synthetic and real events demonstrate that the proposed framework is not only efficacious, but also efficient and scalable.

preprint2012arXiv

Memory Efficient De Bruijn Graph Construction

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from $Θ(kn)$ to $Θ(n)$, where $n$ is the size of the short read database, and $k$ is the length of a $k$-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.

Xifeng Yan

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Can Editing LLMs Inject Harm?

Composite Re-Ranking for Efficient Document Search with BERT

Limitations of Language Models in Arithmetic and Symbolic Induction

Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases

Adaptive-Step Graph Meta-Learner for Few-Shot Graph Classification

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Behavior Query Discovery in System-Generated Temporal Graphs

Observability of Lattice Graphs

Querying Knowledge Graphs by Example Entity Tuples

Inferring the Underlying Structure of Information Cascades

MaTrust: An Effective Multi-Aspect Trust Inference Model

Measuring Two-Event Structural Correlations on Graphs

Memory Efficient De Bruijn Graph Construction