Source author record

Zhipeng Wang

Zhipeng Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Machine Learning Computer Science and Game Theory Human-Computer Interaction Multimedia Neural and Evolutionary Computing quant-ph Sound

Catalog footprint

What is connected

8works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Control Theoretic Approach to Decentralized AI Economy Stabilization via Dynamic Buyback-and-Burn Mechanisms

The democratization of artificial intelligence through decentralized networks represents a paradigm shift in computational provisioning, yet the long-term viability of these ecosystems is critically endangered by the extreme volatility of their native economic layers. Current tokenomic models, which predominantly rely on static or threshold-based buyback heuristics, are ill-equipped to handle complex system dynamics and often function pro-cyclically, exacerbating instability during market downturns. To bridge this gap, we propose the Dynamic-Control Buyback Mechanism (DCBM), a formalized control-theoretic framework that utilizes a Proportional-Integral-Derivative (PID) controller with strict solvency constraints to regulate the token economy as a dynamical system. Extensive agent-based simulations utilizing Jump-Diffusion processes demonstrate that DCBM fundamentally outperforms static baselines, reducing token price volatility by approximately 66% and lowering operator churn from 19.5% to 8.1% in high-volatility regimes. These findings establish that converting tokenomics from static rules into continuous, structurally constrained control loops is a necessary condition for secure and sustainable decentralized intelligence networks.

preprint2026arXiv

AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods. The model and dataset are open-sourced at https://github.com/LLM-VLM-GSL/AHA.

preprint2026arXiv

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

We present a four-stage post-training workflow for LLM reasoning that allocates scarce labeled training data more effectively than standard recipes. The stages are: (1) sparse-reward RL on a larger teacher; (2a) forward-KL warmup on teacher rollouts; (2b) on-policy distillation under student rollouts; (3) optional sparse-reward RL on the deployment student using any held-out labeled data. On verifiable math with a Qwen3-1.7B deployment student, the workflow reaches $79.3\%$ MATH and $25.2\%$ AIME~2024 (avg@16), versus $75.9\%$ and $19.8\%$ for direct GRPO on the same student. We justify the workflow through a reward-density principle: each gradient step of on-policy distillation is a local trust-region update under a dense teacher-induced implicit reward, informative only when the teacher is itself reward-shaped (condition C1) and lies within a small KL of the student (condition C2). Stages~1 and~2a are the explicit devices that enforce C1 and C2. A single component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$ points, and removing the on-policy distillation stage costs $3.3$ points. The recipe replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher.

preprint2026arXiv

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) sections makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different sections (P, CoT, A) affects student performance. Our analysis shows that selective KD over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that beyond a specific length, longer training sequences provide marginal returns for downstream performance but require substantially higher memory and FLOPs. To this end, training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx91\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. Codes are available at https://github.com/weiruichen01/distilling-the-essence.

preprint2026arXiv

EZBlender: Efficient 3D Editing with Plan-and-ReAct Agent

As a cornerstone of the modern digital economy, 3D modeling and rendering demand substantial resources and manual effort when scene editing is performed in the traditional manner. Despite recent progress in VLM-based agents for 3D editing, the fundamental trade-off between editing precision and agent responsiveness remains unresolved. To overcome these limitations, we present EZBlender, a Blender agent with a hybrid framework that combines planning-based task decomposition and reactive local autonomy for efficient human AI collaboration and semantically faithful 3D editing. Specifically, this unexplored Plan-and-ReAct design not only preserves editing quality but also significantly reduces latency and computational cost. To further validate the efficiency and effectiveness of the proposed edge-autonomy architecture, we construct a dedicated multi-tasking benchmark that has not been systematically investigated in prior research. In addition, we provide a comprehensive analysis of language model preference, system responsiveness, and economic efficiency.

preprint2025arXiv

Continuous-variable quantum key distribution network based on entangled states of optical frequency combs

Continuous-variable quantum key distribution (CVQKD) features a high key rate and compatibility with classical optical communication. Developing expandable and efficient CVQKD networks will promote the deployment of large-scale quantum communication networks in the future. This paper proposes a CVQKD network based on the entangled states of an optical frequency comb. This scheme generates Einstein-Podolsky-Rosen entangled states with a frequency comb structure through the process of a type-II optical parametric oscillator. By combining with the scheme of entanglement in the middle, a fully connected CVQKD network capable of distributing secret keys simultaneously can be formed. We analyze the security of the system in the asymptotic case. Simulation results show that under commendable controlling of system loss and noise, the proposed scheme is feasible for deploying a short-distance fully connected CVQKD network. Loss will be the main factor limiting the system's performance. The proposed scheme provides new ideas for a multi-user fully connected CVQKD network.

preprint2022arXiv

Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement

Fine-tuning deep neural networks pre-trained on large scale datasets is one of the most practical transfer learning paradigm given limited quantity of training samples. To obtain better generalization, using the starting point as the reference (SPAR), either through weights or features, has been successfully applied to transfer learning as a regularizer. However, due to the domain discrepancy between the source and target task, there exists obvious risk of negative transfer in a straightforward manner of knowledge preserving. In this paper, we propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED), where the relevant knowledge with respect to the target task is disentangled from the original source model and used as a regularizer during fine-tuning the target model. Specifically, we design two alternative methods, maximizing the Maximum Mean Discrepancy (Max-MMD) and minimizing the mutual information (Min-MI), for the representation disentanglement. Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average. TRED also outperforms related state-of-the-art transfer learning regularizers such as L2-SP, AT, DELTA, and BSS.

preprint2015arXiv

Reinforcement Learning applied to Single Neuron

This paper extends the reinforcement learning ideas into the multi-agents system, which is far more complicated than the previously studied single-agent system. We studied two different multi-agents systems. One is the fully-connected neural network consists of multiple single neurons. Another one is the simplified mechanical arm system which is controlled by multiple neurons. We suppose that each neuron is like an agent and it can do Gibbs sampling of the posterior probability of stimulus features. The policy is optimized in a way that the cumulative global rewards are maximized. The algorithm for the second system is based on the same idea but we incorporate the physics model into the constraints. The simulation results show that for the first system our algorithm converges well. For the second system it does not converge well in a reasonable simulation time length. In summary, we took the initial endeavor to study the reinforcement learning for multi-agents system. The computational complexity is always an issue and significant amount of works have to be done in order to better understand the problem.

Zhipeng Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

A Control Theoretic Approach to Decentralized AI Economy Stabilization via Dynamic Buyback-and-Burn Mechanisms

AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

EZBlender: Efficient 3D Editing with Plan-and-ReAct Agent

Continuous-variable quantum key distribution network based on entangled states of optical frequency combs

Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement

Reinforcement Learning applied to Single Neuron