Source author record

Hejian Sang

Hejian Sang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Machine Learning

Catalog footprint

What is connected

2works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

We present a four-stage post-training workflow for LLM reasoning that allocates scarce labeled training data more effectively than standard recipes. The stages are: (1) sparse-reward RL on a larger teacher; (2a) forward-KL warmup on teacher rollouts; (2b) on-policy distillation under student rollouts; (3) optional sparse-reward RL on the deployment student using any held-out labeled data. On verifiable math with a Qwen3-1.7B deployment student, the workflow reaches $79.3\%$ MATH and $25.2\%$ AIME~2024 (avg@16), versus $75.9\%$ and $19.8\%$ for direct GRPO on the same student. We justify the workflow through a reward-density principle: each gradient step of on-policy distillation is a local trust-region update under a dense teacher-induced implicit reward, informative only when the teacher is itself reward-shaped (condition C1) and lies within a small KL of the student (condition C2). Stages~1 and~2a are the explicit devices that enforce C1 and C2. A single component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$ points, and removing the on-policy distillation stage costs $3.3$ points. The recipe replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher.

preprint2026arXiv

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) sections makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different sections (P, CoT, A) affects student performance. Our analysis shows that selective KD over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that beyond a specific length, longer training sequences provide marginal returns for downstream performance but require substantially higher memory and FLOPs. To this end, training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx91\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. Codes are available at https://github.com/weiruichen01/distilling-the-essence.

Hejian Sang

What is connected

Connect this record

See the researcher in context

Building this map preview

2 published item(s)

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation