Researcher profile

Zihan Xu

Zihan Xu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B. Code is available at: https://github.com/TencentYoutuResearch/SmartSnap

preprint2025arXiv

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47\%) and GAIA (72.8\%) using open-weight models. Our automated generation pipeline achieves over 81\% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7\% and +5.4\% respectively. Moreover, our Agent RL training achieves 40\% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35\% and 21\% on Maths and general/multi-hop QA benchmarks.

preprint2022arXiv

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffers from the semantic mismatch and the mutual compatibility. To address these issues, we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels for each modality, and aligns visual elements and linguistic elements in the form of hierarchy via peer-level semantics alignment and cross-level relation alignment. Furthermore, we soften the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of forcing the model to distinguish compatible negative pairs. Experiments on five downstream tasks demonstrate the effectiveness of the proposed PyramidCLIP. In particular, with the same amount of 15 million pre-training image-text pairs, PyramidCLIP exceeds CLIP on ImageNet zero-shot classification top-1 accuracy by 10.6%/13.2%/10.0% with ResNet50/ViT-B32/ViT-B16 based image encoder respectively. When scaling to larger datasets, PyramidCLIP achieves the state-of-the-art results on several downstream tasks. In particular, the results of PyramidCLIP-ResNet50 trained on 143M image-text pairs surpass that of CLIP using 400M data on ImageNet zero-shot classification task, significantly improving the data efficiency of CLIP.

preprint2022arXiv

Spectral and Energy Efficiency of DCO-OFDM in Visible Light Communication Systems with Finite-Alphabet Inputs

The bound of the information transmission rate of direct current biased optical orthogonal frequency division multiplexing (DCO-OFDM) for visible light communication (VLC) with finite-alphabet inputs is yet unknown, where the corresponding spectral efficiency (SE) and energy efficiency (EE) stems out as the open research problems. In this paper, we derive the exact achievable rate of {the} DCO-OFDM system with finite-alphabet inputs for the first time. Furthermore, we investigate SE maximization problems of {the} DCO-OFDM system subject to both electrical and optical power constraints. By exploiting the relationship between the mutual information and the minimum mean-squared error, we propose a multi-level mercury-water-filling power allocation scheme to achieve the maximum SE. Moreover, the EE maximization problems of {the} DCO-OFDM system are studied, and the Dinkelbach-type power allocation scheme is developed for the maximum EE. Numerical results verify the effectiveness of the proposed theories and power allocation schemes.

preprint2020arXiv

CREDIT: Coarse-to-Fine Sequence Generation for Dialogue State Tracking

In dialogue systems, a dialogue state tracker aims to accurately find a compact representation of the current dialogue status, based on the entire dialogue history. While previous approaches often define dialogue states as a combination of separate triples ({\em domain-slot-value}), in this paper, we employ a structured state representation and cast dialogue state tracking as a sequence generation problem. Based on this new formulation, we propose a {\bf C}oa{\bf R}s{\bf E}-to-fine {\bf DI}alogue state {\bf T}racking ({\bf CREDIT}) approach. Taking advantage of the structured state representation, which is a marked language sequence, we can further fine-tune the pre-trained model (by supervised learning) by optimizing natural language metrics with the policy gradient method. Like all generative state tracking methods, CREDIT does not rely on pre-defined dialogue ontology enumerating all possible slot values. Experiments demonstrate our tracker achieves encouraging joint goal accuracy for the five domains in MultiWOZ 2.0 and MultiWOZ 2.1 datasets.

preprint2012arXiv

Graphene Battery made of Low Cost Reduced Graphene Oxide

Graphene can collect energy from the ambient heat and convert it to electricity, which makes it an ideal candidate for the fabrication of self-powered devices. However, this technology is suffering the high cost, which limits the practical use of it. In this work, we demonstrated that the cost can be reduced by using low cost reduced graphene oxide (RGO), graphite electrodes and low cost glass substrates. The results showed that this technology can be of practical value for the "battery" industry.

preprint2012arXiv

Self-Charged Graphene Battery Harvests Electricity from Thermal Energy of the Environment

The energy of ionic thermal motion presents universally, which is as high as 4 kJ\bullet kg-1\bullet K-1 in aqueous solution, where thermal velocity of ions is in the order of hundreds of meters per second at room temperature1,2. Moreover, the thermal velocity of ions can be maintained by the external environment, which means it is unlimited. However, little study has been reported on converting the ionic thermal energy into electricity. Here we present a graphene device with asymmetric electrodes configuration to capture such ionic thermal energy and convert it into electricity. An output voltage around 0.35 V was generated when the device was dipped into saturated CuCl2 solution, in which this value lasted over twenty days. A positive correlation between the open-circuit voltage and the temperature, as well as the cation concentration, was observed. Furthermore, we demonstrated that this finding is of practical value by lighting a commercial light-emitting diode up with six of such graphene devices connected in series. This finding provides a new way to understand the behavior of graphene at molecular scale and represents a huge breakthrough for the research of self-powered technology. Moreover, the finding will benefit quite a few applications, such as artificial organs, clean renewable energy and portable electronics.