Source author record

Zhichao Xu

Zhichao Xu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Information Retrieval physics.atom-ph physics.optics Databases Social and Information Networks

Catalog footprint

What is connected

8works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

An Empirical Study of Automating Agent Evaluation

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

preprint2026arXiv

Context-aware Decoding Reduces Hallucination in Query-focused Summarization

Query-focused summarization (QFS) aims to provide a summary of a single document/multi documents that can satisfy the information needs of a given query. It is useful for various real-world applications, such as abstractive snippet generation or more recent retrieval augmented generation (RAG). A prototypical QFS pipeline consists of a retriever (sparse or dense retrieval) and a generator (usually a large language model). However, applying large language models (LLM) potentially leads to hallucinations, especially when the evidence contradicts the prior belief of LLMs. There has been growing interest in developing new decoding methods to improve generation quality and reduce hallucination. In this work, we conduct a large-scale reproducibility study on one recently proposed decoding method\, -- \,Context-aware Decoding (CAD). In addition to replicating CAD's experiments on news summarization datasets, we include experiments on QFS datasets, and conduct more rigorous analysis on computational complexity and hyperparameter sensitivity. Experiments with eight different language models show that performance-wise, CAD improves QFS quality by (1) reducing factuality errors/hallucinations while (2) mostly retaining the match of lexical patterns, measured by ROUGE scores, while also at a cost of increased inference-time FLOPs and reduced decoding speed. The \href{https://github.com/zhichaoxu-shufe/context-aware-decoding-qfs}{code implementation} based on Huggingface Library is made available

preprint2026arXiv

LACONIC: Dense-Level Effectiveness for Scalable Sparse Retrieval via a Two-Phase Training Curriculum

While dense retrieval models have become the standard for state-of-the-art information retrieval, their deployment is often constrained by high memory requirements and reliance on GPU accelerators for vector similarity search. Learned sparse retrieval offers a compelling alternative by enabling efficient search via inverted indices, yet it has historically received less attention than dense approaches. In this report, we introduce LACONIC, a family of learned sparse retrievers based on the Llama-3 architecture (1B, 3B, and 8B). We propose a streamlined two-phase training curriculum consisting of (1) weakly supervised pre-finetuning to adapt causal LLMs for bidirectional contextualization and (2) high-signal finetuning using curated hard negatives. Our results demonstrate that LACONIC effectively bridges the performance gap with dense models: the 8B variant achieves a state-of-the-art 60.2 nDCG on the MTEB Retrieval benchmark, ranking 15th on the leaderboard as of January 1, 2026, while utilizing 71\% less index memory than an equivalent dense model. By delivering high retrieval effectiveness on commodity CPU hardware with a fraction of the compute budget required by competing models, LACONIC provides a scalable and efficient solution for real-world search applications.

preprint2026arXiv

RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers

Transformer structure has achieved great success in multiple applied machine learning communities, such as natural language processing (NLP), computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism\, -- \,attention requires $O(n^2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine Mamba's efficacy through the lens of a classical IR task\, -- \,document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that \textbf{(1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention.} We hope this study can serve as a starting point to explore \mamba models in other classical IR tasks. Our \href{https://github.com/zhichaoxu-shufe/RankMamba}{code implementation} is made public to facilitate reproducibility. Refer to~\cite{xu-etal-2025-state} for more comprehensive experiments and results, including passage ranking.

preprint2020arXiv

E-commerce Recommendation with Weighted Expected Utility

Different from shopping at retail stores, consumers on e-commerce platforms usually cannot touch or try products before purchasing, which means that they have to make decisions when they are uncertain about the outcome (e.g., satisfaction level) of purchasing a product. To study people's preferences, economics researchers have proposed the hypothesis of Expected Utility (EU) that models the subject value associated with an individual's choice as the statistical expectations of that individual's valuations of the outcomes of this choice. Despite its success in studies of game theory and decision theory, the effectiveness of EU, however, is mostly unknown in e-commerce recommendation systems. Previous research on e-commerce recommendation interprets the utility of purchase decisions either as a function of the consumed quantity of the product or as the gain of sellers/buyers in the monetary sense. As most consumers just purchase one unit of a product at a time and most alternatives have similar prices, such modeling of purchase utility is likely to be inaccurate in practice. In this paper, we interpret purchase utility as the satisfaction level a consumer gets from a product and propose a recommendation framework using EU to model consumers' behavioral patterns. We assume that consumer estimates the expected utilities of all the alternatives and choose products with maximum expected utility for each purchase. To deal with the potential psychological biases of each consumer, we introduce the usage of Probability Weight Function (PWF) and design our algorithm based on Weighted Expected Utility (WEU). Empirical study on real-world e-commerce datasets shows that our proposed ranking-based recommendation framework achieves statistically significant improvement against both classical Collaborative Filtering/Latent Factor Models and state-of-the-art deep models in top-K recommendation.

preprint2014arXiv

Dual-wavelength active optical clock

We experimentally realize the dual-wavelength active optical clock for the first time. As the Cs cell temperature is kept between 118 $^{\circ }C$ and 144 $^{\circ }C$, both the 1359 nm and the 1470 nm stimulated emission output of Cs four-level active optical clock are detected. The 1470 nm output linewidth of each experimental setup of Cs four-level active optical clock is measured to be 590 Hz with the main cavity length unstabilized. To stabilize the cavity length of active optical clock, the experimental scheme of 633 nm and 1359 nm good-bad cavity dual-wavelength active optical clock is proposed, where 633 nm and 1359 nm stimulated emission is working at good-cavity and bad-cavity regime respectively. The cavity length is stabilized by locking the 633 nm output frequency to a super-cavity with the Pound-Drever-Hall (PDH) technique. The frequency stability of 1359 nm bad-cavity stimulated emission output is then expected to be further improved by at least 1 order of magnitude than the 633 nm PDH system due to the suppressed cavity pulling effect of active optical clock, and the quantum limited linewidth of 1359 nm output is estimated to be 77.6 mHz.

preprint2014arXiv

Lasing and suppressed cavity-pulling effect of Cesium active optical clock

We experimentally demonstrate the collective emission behavior and suppressed cavity-pulling effect of four-level active optical clock with Cesium atoms. Thermal Cesium atoms in a glass cell velocity selective pumped with a 455.5 nm laser operating at 6S$_{1/2}$ to 7P$_{3/2}$ transition are used as lasing medium. Population inverted Cesium atoms between 7S$_{1/2}$ and 6P$_{3/2}$ levels are optical weakly coupled by a pair cavity mirrors working at deep bad-cavity regime with a finesse of 4.3, and the ratio between cavity bandwidth and gain bandwidth is approximately 45. With increased 455.5 nm pumping laser intensity, the output power of cesium active optical clock at 1469.9 nm from 7S$_{1/2}$ level to 6P$_{3/2}$ level shows a threshold and reach a power of 13 $μ$W. Active optical clock would dramatically improve the optical clock stability since the lasing frequency does not follow the cavity length variation exactly, but in a form of suppressed cavity pulling effect. In this letter the cavity pulling effect is measured using a Fabry-Perot interferometer (FPI) to be reduced by a factor of 38.2 and 41.4 as the detuning between the 1469.9 nm cavity length of the Cs active optical clock and the Cs 1469.9 nm transition is set to be 140.8 MHz and 281.6 MHz respectively. The mechanism demonstrated here is of great significance for new generation optical clocks and can be applied to improve the stability of best optical clocks by at least two orders of magnitude.

preprint2014arXiv

Towards Efficient Path Query on Social Network with Hybrid RDF Management

The scalability and exibility of Resource Description Framework(RDF) model make it ideally suited for representing online social networks(OSN). One basic operation in OSN is to find chains of relations,such as k-Hop friends. Property path query in SPARQL can express this type of operation, but its implementation suffers from performance problem considering the ever growing data size and complexity of OSN.In this paper, we present a main memory/disk based hybrid RDF data management framework for efficient property path query. In this hybrid framework, we realize an efficient in-memory algebra operator for property path query using graph traversal, and estimate the cost of this operator to cooperate with existing cost-based optimization. Experiments on benchmark and real dataset demonstrated that our approach can achieve a good tradeoff between data load expense and online query performance.

Zhichao Xu

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

An Empirical Study of Automating Agent Evaluation

Context-aware Decoding Reduces Hallucination in Query-focused Summarization

LACONIC: Dense-Level Effectiveness for Scalable Sparse Retrieval via a Two-Phase Training Curriculum

RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers

E-commerce Recommendation with Weighted Expected Utility

Dual-wavelength active optical clock

Lasing and suppressed cavity-pulling effect of Cesium active optical clock

Towards Efficient Path Query on Social Network with Hybrid RDF Management