Researcher profile

Linyang Li

Linyang Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

preprint2026arXiv

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.

preprint2026arXiv

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

preprint2024arXiv

Rectangular carbon nitrides C4N monolayers with a zigzag buckled structure: Quasi-one-dimensional Dirac nodal lines and topological flat edge states

Due to the flexibility of C and N atoms in forming different types of bonds, the prediction of new two-dimensional (2D) carbon nitrides is a hot topic in the field of carbon-based materials. Using first-principles calculations, we propose two C4N monolayers with a zigzag buckled (ZB) structure. The ZB C4N monolayers contain raised-C (raised-N) atoms with sp3 hybridization, different from the traditional 2D graphene-like carbon nitride materials with sp2 hybridization. Interestingly, the band structures of the ZB C4N monolayers exhibit quasi-one-dimensional (quasi-1D) Dirac nodal line that results from the corresponding quasi-1D structure of the zigzag carbon chains, which is essentially different from the more common ring-shaped nodal line. The quasi-1D Dirac nodal line exhibits the following features: (i) gapless Dirac points, (ii) varying Fermi velocity, and (iii) slightly curved band along the high-symmetry path. All these features are successfully explained by our proposed tight-binding model that includes interactions up to the third nearest-neighbor. The Fermi velocity of the 2D system can reach 105 m/s, which is promising for applications in high-speed electronic devices. The topological flat band structure determined by the Zak phase and band inversion of the corresponding 1D system is edge-dependent, which is corresponding to the Su-Schrieffer-Heeger model, providing to rich physical phenomena.

preprint2022arXiv

"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Whole word masking (WWM), which masks all subwords corresponding to a word at once, makes a better English BERT model. For the Chinese language, however, there is no subword because each token is an atomic character. The meaning of a word in Chinese is different in that a word is a compositional unit consisting of multiple characters. Such difference motivates us to investigate whether WWM leads to better context understanding ability for Chinese BERT. To achieve this, we introduce two probing tasks related to grammatical error correction and ask pretrained models to revise or insert tokens in a masked language modeling manner. We construct a dataset including labels for 19,075 tokens in 10,448 sentences. We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively. Our major findings are as follows: First, when one character needs to be inserted or replaced, the model trained with CLM performs the best. Second, when more than one character needs to be handled, WWM is the key to better performance. Finally, when being fine-tuned on sentence-level downstream tasks, models trained with different masking strategies perform comparably.

preprint2021arXiv

Ferromagnetism with in-plane magnetization, Dirac spin-gapless semiconducting property, and tunable topological states in two-dimensional rare-earth-metal dinitrides

As the bulk single-crystal MoN2/ReN2 with a layered structure was successfully synthesized in experiment, transition-metal dinitrides have attracted considerable attention in recent years. Here, we focus on rare-earth-metal (Rem) elements and propose seven stable Rem dinitride monolayers with a 1T structure, namely 1T-RemN2. These monolayers have a ferromagnetic ground state with in-plane magnetization. Without spin-orbit coupling (SOC) effect, the band structures are spin-polarized with Dirac points at the Fermi level. Remarkably, the 1T-LuN2 monolayer shows an isotropic magnetic anisotropy energy in the xy-plane with in-plane magnetization, indicating easy tunability of the magnetization direction. When rotating the magnetization vector in the xy-plane, our proposed model can accurately describe the variety of the SOC band gap and two topological states (Weyl-like semimetal and Chern insulator states) appear with tunable properties. The Weyl-like semimetal state is a critical point between the two Chern insulator states with opposite sign of the Chern numbers. The large nontrivial band gap (up to 60.3 meV) and the Weyl-like semimetal state are promising for applications in spintronic devices.

preprint2020arXiv

Generating Adversarial Examples in Chinese Texts Using Sentence-Pieces

Adversarial attacks in texts are mostly substitution-based methods that replace words or characters in the original texts to achieve success attacks. Recent methods use pre-trained language models as the substitutes generator. While in Chinese, such methods are not applicable since words in Chinese require segmentations first. In this paper, we propose a pre-train language model as the substitutes generator using sentence-pieces to craft adversarial examples in Chinese. The substitutions in the generated adversarial examples are not characters or words but \textit{'pieces'}, which are more natural to Chinese readers. Experiments results show that the generated adversarial samples can mislead strong target models and remain fluent and semantically preserved.

preprint2020arXiv

PAI-graphene: a new topological semimetallic two-dimensional carbon allotrope with highly tunable anisotropic Dirac cones

Using evolutionary algorithm for crystal structure prediction, we present a new stable two-dimensional (2D) carbon allotrope composed of polymerized as-indacenes (PAI) in a zigzag pattern, namely PAI-graphene whose energy is lower than most of the reported 2D allotropes of graphene. Crucially, the crystal structure realizes a nonsymmorphic layer group that enforces a nontrivial global topology of the band structure with two Dirac cones lying perfectly at the Fermi level. The absence of electron/hole pockets makes PAI-graphene a pristine crystalline topological semimetal having anisotropic Fermi velocities with a high value of $7.0 \times 10^{5}$ m/s. We show that while the semimetallic property of the allotrope is robust against the application of strain, the positions of the Dirac cone and the Fermi velocities can be modified significantly with strain. Moreover, by combining strain along both the x- and y-directions, two band inversions take place at $Γ$ leading to the annihilation of the Dirac nodes demonstrating the possibility of strain-controlled conversion of a topological semimetal into a semiconductor. Finally we formulate the bulk-boundary correspondence of the topological nodal phase in the form of a generalized Zak-phase argument finding a perfect agreement with the topological edge states computed for different edge-terminations.

preprint2020arXiv

Structural phase transition in monolayer gold(I) telluride: From a room-temperature topologicalinsulator to an auxetic semiconductor

Structural phase transitions between semiconductors and topological insulators have rich applications in nanoelectronics but are rarely found in two-dimensional (2D) materials. In this work, by combining ab initio computations and evolutionary structure search, we investigate two stable 2D forms of gold(I) telluride (Au$_{2}$Te) with square symmetry, noted as s(I)- and s(II)-Au$_{2}$Te. s(II)-Au$_{2}$Te is the global minimum structure and is a room-temperature topological insulator. s(I)-Au$_{2}$Te is a direct-gap semiconductor with high carrier mobilities and unusual in-plane negative Poisson's ratio. Both s(I) and s(II) phases have ultra-low Young's modulus, implying high flexibility. By applying a small tensile strain, s(II)-Au$_{2}$Te can be transformed into s(I)-Au$_{2}$Te. Hence, a structural phase transition from a room-temperature topological insulator to an auxetic semiconductor is found in the 2D forms of Au$_{2}$Te, which enables potential applications in phase-change electronic devices. Moreover, we elucidate the mechanism of the phase transition with the help of phonon spectra and group theory analysis.

preprint2020arXiv

The magnetic, electronic, and light-induced topological properties in two-dimensional hexagonal FeX2 (X = Cl, Br, I) monolayers

Topological materials are fertile ground for investigating topological phases of matter and topological phase transitions. In particular, the quest for novel topological phases in 2D materials is attracting fast growing attention. Here, using Floquet-Bloch theory, we propose to realize chiral topological phases in 2D hexagonal FeX2 (X=Cl, Br, I) monolayers under irradiation of circularly polarized light. Such 2D FeX2 monolayers are predicted to be dynamical stable, and exhibit both ferromagnetic and semiconducting properties. To capture the full topological physics of the magnetic semiconductor under periodic driving, we adopt ab initio Wannier-based tight-binding methods for the Floquet-Bloch bands, with the light-induced band gap closings and openings being obtained as the light field strength increases. The calculations of slab with open boundaries show the existence of chiral edge states. Interestingly, the topological transitions with branches of chiral edge states changing from zero to one and from one to two by tuning the light amplitude are obtained, showing that the topological floquet phase of high Chern number can be induced in the present Floquet-Bloch systems.