Researcher profile

Yixin Cao

Yixin Cao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
21works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

21 published item(s)

preprint2026arXiv

ARM: Role-Conditioned Neuron Transplantation for Training-Free Generalist LLM Agent Merging

Interactive large language model agents have advanced rapidly, but most remain specialized to a single environment and fail to adapt robustly to other environments. Model merging offers a training-free alternative by integrating multiple experts into a single model. In this paper, we propose Agent-Role Merging (ARM), an activation-guided, role-conditioned neuron transplantation method for model merging in LLM agents. ARM improves existing merging methods from static natural language tasks to multi-turn agent scenarios, and over the generalization ability across various interactive environments. This is achieved with a well designed 3-step framework: 1) constructing merged backbones, 2) selection based on its role-conditioned activation analysis, and 3) neuron transplantation for fine-grained refinements. Without gradient-based optimization, ARM improves cross-benchmark generalization while enjoying efficiency. Across diverse domains, the model obtained via ARM merging outperforms prior model merging methods and domain-specific expert models, while demonstrating strong out-of-domain generalization.

preprint2026arXiv

SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.

preprint2026arXiv

Thinking Traps in Long Chain-of-Thought: A Measurable Study and Trap-Aware Adaptive Restart

Scaling test-time compute via Long Chain-of-Thought (Long-CoT) significantly enhances reasoning capabilities, yet extended generation does not guarantee correctness: after an early wrong commitment, models may keep elaborating a self-consistent but incorrect prefix. Through fine-grained trajectory analysis, we identify Thinking Traps, prefix-dominant deadlocks where later reflection, alternative attempts, or verification fails to revise the root error. On a curated subset of DAPO-MATH, 89\% of failures exhibit such traps. To solve this problem, we introduce TAAR (Trap-Aware Adaptive Restart), a test-time control framework that trains a diagnostic policy to predict two signals from partial trajectories: a trap index for where to truncate and an escape probability for whether and how strongly to intervene. At inference time, TAAR truncates the trajectory before the predicted trap segment and adaptively restarts decoding; for severely trapped cases, it applies stronger perturbations, including higher-temperature resampling and an optional structured reboot suffix. Experiments on challenging mathematical and scientific reasoning benchmarks (AIME24, AIME25, GPQA-Diamond, HMMT25, BRUMO25) show that TAAR improves reasoning performance without fine-tuning base model parameters.

preprint2026arXiv

UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.

preprint2026arXiv

What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding

Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks, yet their ability to generalize across varying environments remains a under-examined concern. Current evaluation paradigms predominantly rely on trajectory-based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment. To address this gap, we propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding. We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels. Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment. These findings identify proactive exploration and fine-grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents.

preprint2022arXiv

ERGO: Event Relational Graph Transformer for Document-level Event Causality Identification

Document-level Event Causality Identification (DECI) aims to identify causal relations between event pairs in a document. It poses a great challenge of across-sentence reasoning without clear causal indicators. In this paper, we propose a novel Event Relational Graph TransfOrmer (ERGO) framework for DECI, which improves existing state-of-the-art (SOTA) methods upon two aspects. First, we formulate DECI as a node classification problem by constructing an event relational graph, without the needs of prior knowledge or tools. Second, ERGO seamlessly integrates event-pair relation classification and global inference, which leverages a Relational Graph Transformer (RGT) to capture the potential causal chain. Besides, we introduce edge-building strategies and adaptive focal loss to deal with the massive false positives caused by common spurious correlation. Extensive experiments on two benchmark datasets show that ERGO significantly outperforms previous SOTA methods (13.1% F1 gains on average). We have conducted extensive quantitative analysis and case studies to provide insights for future research directions (Section 4.8).

preprint2022arXiv

Explainable Sparse Knowledge Graph Completion via High-order Graph Reasoning Network

Knowledge Graphs (KGs) are becoming increasingly essential infrastructures in many applications while suffering from incompleteness issues. The KG completion task (KGC) automatically predicts missing facts based on an incomplete KG. However, existing methods perform unsatisfactorily in real-world scenarios. On the one hand, their performance will dramatically degrade along with the increasing sparsity of KGs. On the other hand, the inference procedure for prediction is an untrustworthy black box. This paper proposes a novel explainable model for sparse KGC, compositing high-order reasoning into a graph convolutional network, namely HoGRN. It can not only improve the generalization ability to mitigate the information insufficiency issue but also provide interpretability while maintaining the model's effectiveness and efficiency. There are two main components that are seamlessly integrated for joint optimization. First, the high-order reasoning component learns high-quality relation representations by capturing endogenous correlation among relations. This can reflect logical rules to justify a broader of missing facts. Second, the entity updating component leverages a weight-free Graph Convolutional Network (GCN) to efficiently model KG structures with interpretability. Unlike conventional methods, we conduct entity aggregation and design composition-based attention in the relational space without additional parameters. The lightweight design makes HoGRN better suitable for sparse settings. For evaluation, we have conducted extensive experiments-the results of HoGRN on several sparse KGs present impressive improvements (9% MRR gain on average). Further ablation and case studies demonstrate the effectiveness of the main components. Our codes will be released upon acceptance.

preprint2022arXiv

Prompt for Extraction? PAIE: Prompting Argument Interaction for Event Argument Extraction

In this paper, we propose an effective yet efficient model PAIE for both sentence-level and document-level Event Argument Extraction (EAE), which also generalizes well when there is a lack of training data. On the one hand, PAIE utilizes prompt tuning for extractive objectives to take the best advantages of Pre-trained Language Models (PLMs). It introduces two span selectors based on the prompt to select start/end tokens among input texts for each role. On the other hand, it captures argument interactions via multi-role prompts and conducts joint optimization with optimal span assignments via a bipartite matching loss. Also, with a flexible prompt design, PAIE can extract multiple arguments with the same role instead of conventional heuristic threshold tuning. We have conducted extensive experiments on three benchmarks, including both sentence- and document-level EAE. The results present promising improvements from PAIE (3.5\% and 2.3\% F1 gains in average on three benchmarks, for PAIE-base and PAIE-large respectively). Further analysis demonstrates the efficiency, generalization to few-shot settings, and effectiveness of different extractive prompt tuning strategies. Our code is available at https://github.com/mayubo2333/PAIE.

preprint2022arXiv

Training Free Graph Neural Networks for Graph Matching

We present a framework of Training Free Graph Matching (TFGM) to boost the performance of Graph Neural Networks (GNNs) based graph matching, providing a fast promising solution without training (training-free). TFGM provides four widely applicable principles for designing training-free GNNs and is generalizable to supervised, semi-supervised, and unsupervised graph matching. The keys are to handcraft the matching priors, which used to be learned by training, into GNN's architecture and discard the components inessential under the training-free setting. Further analysis shows that TFGM is a linear relaxation to the quadratic assignment formulation of graph matching and generalizes TFGM to a broad set of GNNs. Extensive experiments show that GNNs with TFGM achieve comparable (if not better) performances to their fully trained counterparts, and demonstrate TFGM's superiority in the unsupervised setting. Our code is available at https://github.com/acharkq/Training-Free-Graph-Matching.

preprint2022arXiv

VEM$^2$L: A Plug-and-play Framework for Fusing Text and Structure Knowledge on Sparse Knowledge Graph Completion

Knowledge Graph Completion (KGC) aims to reason over known facts and infer missing links but achieves weak performances on those sparse Knowledge Graphs (KGs). Recent works introduce text information as auxiliary features or apply graph densification to alleviate this challenge, but suffer from problems of ineffectively incorporating structure features and injecting noisy triples. In this paper, we solve the sparse KGC from these two motivations simultaneously and handle their respective drawbacks further, and propose a plug-and-play unified framework VEM$^2$L over sparse KGs. The basic idea of VEM$^2$L is to motivate a text-based KGC model and a structure-based KGC model to learn with each other to fuse respective knowledge into unity. To exploit text and structure features together in depth, we partition knowledge within models into two nonoverlapping parts: expressiveness ability on the training set and generalization ability upon unobserved queries. For the former, we motivate these two text-based and structure-based models to learn from each other on the training sets. And for the generalization ability, we propose a novel knowledge fusion strategy derived by the Variational EM (VEM) algorithm, during which we also apply a graph densification operation to alleviate the sparse graph problem further. Our graph densification is derived by VEM algorithm. Due to the convergence of EM algorithm, we guarantee the increase of likelihood function theoretically with less being impacted by noisy injected triples heavily. By combining these two fusion methods and graph densification, we propose the VEM$^2$L framework finally. Both detailed theoretical evidence, as well as qualitative experiments, demonstrates the effectiveness of our proposed framework.

preprint2022arXiv

What Makes the Story Forward? Inferring Commonsense Explanations as Prompts for Future Event Generation

Prediction over event sequences is critical for many real-world applications in Information Retrieval and Natural Language Processing. Future Event Generation (FEG) is a challenging task in event sequence prediction because it requires not only fluent text generation but also commonsense reasoning to maintain the logical coherence of the entire event story. In this paper, we propose a novel explainable FEG framework, Coep. It highlights and integrates two types of event knowledge, sequential knowledge of direct event-event relations and inferential knowledge that reflects the intermediate character psychology between events, such as intents, causes, reactions, which intrinsically pushes the story forward. To alleviate the knowledge forgetting issue, we design two modules, Im and Gm, for each type of knowledge, which are combined via prompt tuning. First, Im focuses on understanding inferential knowledge to generate commonsense explanations and provide a soft prompt vector for Gm. We also design a contrastive discriminator for better generalization ability. Second, Gm generates future events by modeling direct sequential knowledge with the guidance of Im. Automatic and human evaluation demonstrate that our approach can generate more coherent, specific, and logical future events.

preprint2021arXiv

Exploring and Evaluating Attributes, Values, and Structures for Entity Alignment

Entity alignment (EA) aims at building a unified Knowledge Graph (KG) of rich content by linking the equivalent entities from various KGs. GNN-based EA methods present promising performances by modeling the KG structure defined by relation triples. However, attribute triples can also provide crucial alignment signal but have not been well explored yet. In this paper, we propose to utilize an attributed value encoder and partition the KG into subgraphs to model the various types of attribute triples efficiently. Besides, the performances of current EA methods are overestimated because of the name-bias of existing EA datasets. To make an objective evaluation, we propose a hard experimental setting where we select equivalent entity pairs with very different names as the test set. Under both the regular and hard settings, our method achieves significant improvements ($5.10\%$ on average Hits@$1$ in DBP$15$k) over $12$ baselines in cross-lingual and monolingual datasets. Ablation studies on different subgraphs and a case study about attribute types further demonstrate the effectiveness of our method. Source code and data can be found at https://github.com/thunlp/explore-and-evaluate.

preprint2021arXiv

Flashot: A Snapshot of Flash Loan Attack on DeFi Ecosystem

Flash Loan attack can grab millions of dollars from decentralized vaults in one single transaction, drawing increasing attention from the Decentralized Finance (DeFi) players. It has also demonstrated an exciting opportunity that a huge wealth could be created by composing DeFi's building blocks and exploring the arbitrage change. However, a fundamental framework to study the field of DeFi has not yet reached a consensus and there's a lack of standard tools or languages to help better describe, design and improve the running processes of the infant DeFi systems, which naturally makes it harder to understand the basic principles behind the complexity of Flash Loan attacks. In this paper, we are the first to propose Flashot, a prototype that is able to transparently illustrate the precise asset flows intertwined with smart contracts in a standardized diagram for each Flash Loan event. Some use cases are shown and specifically, based on Flashot, we study a typical Pump and Arbitrage case and present in-depth economic explanations to the attacker's behaviors. Finally, we conclude the development trends of Flash Loan attacks and discuss the great impact on DeFi ecosystem brought by Flash Loan. We envision a brand new quantitative financial industry powered by highly efficient automatic risk and profit detection systems based on the blockchain.

preprint2020arXiv

Enumerating Maximal Induced Subgraphs

Given a graph $G$, the maximal induced subgraphs problem asks to enumerate all maximal induced subgraphs of $G$ that belong to a certain hereditary graph class. While its optimization version, known as the minimum vertex deletion problem in literature, has been intensively studied, enumeration algorithms are known for a few simple graph classes, e.g., independent sets, cliques, and forests, until very recently [Conte and Uno, STOC 2019]. There is also a connected variation of this problem, where one is concerned with only those induced subgraphs that are connected. We introduce two new approaches, which enable us to develop algorithms that solve both variations for a number of important graph classes. A general technique that has been proved very powerful in enumeration algorithms is to build a solution map, i.e., a multiple digraph on all the solutions of the problem, and the key of this approach is to make the solution map strongly connected, so that a simple traversal of the solution map solves the problem. We introduce retaliation-free paths to certificate strong connectedness of the solution map we build. Generalizing the idea of Cohen, Kimelfeld, and Sagiv [JCSS 2008], we introduce the $t$-restricted version, $t$ being a positive integer, of the maximal (connected) induced subgraphs problem, and show that it is equivalent to the original problem in terms of solvability in incremental polynomial time. Moreover, we give reductions between the two variations, so that it suffices to solve one of the variations for each class we study. Our work also leads to direct and simpler proofs of several important known results.

preprint2020arXiv

Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen

The curse of knowledge can impede communication between experts and laymen. We propose a new task of expertise style transfer and contribute a manually annotated dataset with the goal of alleviating such cognitive biases. Solving this task not only simplifies the professional language, but also improves the accuracy and expertise level of laymen descriptions using simple words. This is a challenging task, unaddressed in previous work, as it requires the models to have expert intelligence in order to modify text with a deep understanding of domain knowledge and structures. We establish the benchmark performance of five state-of-the-art models for style transfer and text simplification. The results demonstrate a significant gap between machine and human performance. We also discuss the challenges of automatic evaluation, to provide insights into future research directions. The dataset is publicly available at https://srhthu.github.io/expertise-style-transfer.

preprint2020arXiv

Polynomial Kernels for Paw-free Edge Modification Problems

Let $H$ be a fixed graph. Given a graph $G$ and an integer $k$, the $H$-free edge modification problem asks whether it is possible to modify at most $k$ edges in $G$ to make it $H$-free. Sandeep and Sivadasan (IPEC 2015) asks whether the paw-free completion problem and the paw-free edge deletion problem admit polynomial kernels. We answer both questions affirmatively by presenting, respectively, $O(k)$-vertex and $O(k^4)$-vertex kernels for them. This is part of an ongoing program that aims at understanding compressibility of $H$-free edge modification problems.

preprint2020arXiv

Reinforced Negative Sampling over Knowledge Graph for Recommendation

Properly handling missing data is a fundamental challenge in recommendation. Most present works perform negative sampling from unobserved data to supply the training of recommender models with negative signals. Nevertheless, existing negative sampling strategies, either static or adaptive ones, are insufficient to yield high-quality negative samples --- both informative to model training and reflective of user real needs. In this work, we hypothesize that item knowledge graph (KG), which provides rich relations among items and KG entities, could be useful to infer informative and factual negative samples. Towards this end, we develop a new negative sampling model, Knowledge Graph Policy Network (KGPolicy), which works as a reinforcement learning agent to explore high-quality negatives. Specifically, by conducting our designed exploration operations, it navigates from the target positive interaction, adaptively receives knowledge-aware negative signals, and ultimately yields a potential negative item to train the recommender. We tested on a matrix factorization (MF) model equipped with KGPolicy, and it achieves significant improvements over both state-of-the-art sampling methods like DNS and IRGAN, and KG-enhanced recommender models like KGAT. Further analyses from different angles provide insights of knowledge-aware sampling. We release the codes and datasets at https://github.com/xiangwang1223/kgpolicy.

preprint2020arXiv

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries. To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach.

preprint2020arXiv

X-ray tomography investigation of cyclically sheared granular materials

We perform combined X-ray tomography and shear force measurements on a cyclically sheared granular system with highly transient behaviors, and obtain the evolution of microscopic structures and the macroscopic shear force during the shear cycle. We explain the macroscopic behaviors of the system based on microscopic processes, including the particle level structural rearrangement and frictional contact variation. Specifically, we show how contact friction can induce large structural fluctuations and cause significant shear dilatancy effect for granular materials, and we also construct an empirical constitutive relationship for the macroscopic shear force.