LLMs Can Learn to Reason Via Off-Policy RL
Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differ...
The paper is available now on arxiv (arxiv.org/abs/2602.19362), along with checkpoints and data (huggingface.co/collections/...) and code (github.com/danieldritte...). Thanks to the team at Harvard/Cornell/Databricks! (Owen Oertell, Bradley Guo, Jonathan Chang, @xkianteb.bsky.social and Wen Sun)
27.02.2026 22:27
๐ 1
๐ 0
๐ฌ 0
๐ 0
This framing enables effective, stable training with off-policy data (up to 400 gradient steps before synchronizing the generator and trainer) and is more sample efficient (e.g. 3x fewer samples while matching the performance of DeepCoder) while matching DeepCoder's performance.
27.02.2026 22:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
and is up to 3x more sample efficient than Deepcoder, while matching performance
27.02.2026 22:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
We observe that OAPL outperforms GRPO in math tasks,
27.02.2026 22:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
On-policy methods often need importance sampling corrections, but OAPL avoids this by framing the off-policy gap as a KL-regularized RL problem, explicitly optimizing the trainer to stay near the data-generating distribution.
27.02.2026 22:27
๐ 1
๐ 0
๐ฌ 1
๐ 0
Most post-training pipelines use on-policy algorithms (e.g. GRPO), but post-training infrastructure is often not truly on-policy (due to asynchronous generation and trainer/inference engine differences).
27.02.2026 22:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
In our new preprint we show that you can surpass GRPO while being super off policy!
We introduce Optimal Advantage-based Policy Optimization with a Lagged Inference Policy (OAPL), that is off-policy by design.
27.02.2026 22:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
Does LLM RL post-training need to be on-policy?
27.02.2026 22:27
๐ 2
๐ 0
๐ฌ 1
๐ 0