Source author record

Tianyi Lyu

Tianyi Lyu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence cond-mat.mtrl-sci physics.app-ph

Catalog footprint

What is connected

3works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

preprint2026arXiv

Radiation-induced Instability of Organic-Inorganic Halide Perovskite Single Crystals

Organic-inorganic halide perovskites (OIHPs) are promising optoelectronic materials, but their instability under radiation environments restricts their durability and practical applications. Here we employ electron and synchrotron X-ray beams, individually, to investigate the radiation-induced instability of two types of OIHP single crystals (FAPbBr3 and MAPbBr3). Under the electron beam, we observe that 3-point star-style cracks grow on the surface of FAPbBr3, and bricklayer-style cracks are formed on the surface of MAPbBr3. Under the X-ray beam, a new composition without organic components appears in both FAPbBr3 and MAPbBr3. Such cracking and composition changes are attributed to the volatilization of organic components. We propose a volume-strain-based mechanism, in which the energy conversion results from the organic cation loss. Using nanoindentation, we reveal that beam radiations reduce the Youngs modulus and increase the hardness of both OIHPs. This study provides valuable insights into the structural and mechanical stabilities of OIHP single crystals in radiation environments.

preprint2026arXiv

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint