Source author record

Weishi Wang

Weishi Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Multiagent Systems quant-ph

Catalog footprint

What is connected

2works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.

preprint2022arXiv

Quantum Computing 2022

Quantum technology is full of figurative and literal noise obscuring its promise. In this overview, we will attempt to provide a sober assessment of the promise of quantum technology with a focus on computing. We provide a tour of quantum computing and quantum technology that is aimed to be comprehensible to scientists and engineers without becoming a popular account. The goal is not a comprehensive review nor a superficial introduction but rather to serve as a useful map to navigate the hype, the scientific literature, and upcoming press releases about quantum technology and quantum computing. We have aimed to cite the most recent topical reviews, key results, and guide the reader away from fallacies and towards active discussions in the current quantum computing literature. The goal of this article was to be pedantic and introductory without compromising on the science.