Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
We present a four-stage post-training workflow for LLM reasoning that allocates scarce labeled training data more effectively than standard recipes. The stages are: (1) sparse-reward RL on a larger teacher; (2a) forward-KL warmup on teacher rollouts; (2b) on-policy distillation under student rollouts; (3) optional sparse-reward RL on the deployment student using any held-out labeled data. On verifiable math with a Qwen3-1.7B deployment student, the workflow reaches $79.3\%$ MATH and $25.2\%$ AIME~2024 (avg@16), versus $75.9\%$ and $19.8\%$ for direct GRPO on the same student. We justify the workflow through a reward-density principle: each gradient step of on-policy distillation is a local trust-region update under a dense teacher-induced implicit reward, informative only when the teacher is itself reward-shaped (condition C1) and lies within a small KL of the student (condition C2). Stages~1 and~2a are the explicit devices that enforce C1 and C2. A single component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$ points, and removing the on-policy distillation stage costs $3.3$ points. The recipe replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher.