Daniel Ritter's Avatar

Daniel Ritter

@danielritter

CS PhD Student @ Cornell University, Prev. Software Engineer in the Marks Lab @ Harvard Med

70
Followers
610
Following
8
Posts
24.10.2024
Joined
Posts Following

Latest posts by Daniel Ritter @danielritter

Preview
LLMs Can Learn to Reason Via Off-Policy RL Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differ...

The paper is available now on arxiv (arxiv.org/abs/2602.19362), along with checkpoints and data (huggingface.co/collections/...) and code (github.com/danieldritte...). Thanks to the team at Harvard/Cornell/Databricks! (Owen Oertell, Bradley Guo, Jonathan Chang, @xkianteb.bsky.social and Wen Sun)

27.02.2026 22:27 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

This framing enables effective, stable training with off-policy data (up to 400 gradient steps before synchronizing the generator and trainer) and is more sample efficient (e.g. 3x fewer samples while matching the performance of DeepCoder) while matching DeepCoder's performance.

27.02.2026 22:27 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

and is up to 3x more sample efficient than Deepcoder, while matching performance

27.02.2026 22:27 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

We observe that OAPL outperforms GRPO in math tasks,

27.02.2026 22:27 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

On-policy methods often need importance sampling corrections, but OAPL avoids this by framing the off-policy gap as a KL-regularized RL problem, explicitly optimizing the trainer to stay near the data-generating distribution.

27.02.2026 22:27 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Most post-training pipelines use on-policy algorithms (e.g. GRPO), but post-training infrastructure is often not truly on-policy (due to asynchronous generation and trainer/inference engine differences).

27.02.2026 22:27 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

In our new preprint we show that you can surpass GRPO while being super off policy!
We introduce Optimal Advantage-based Policy Optimization with a Lagged Inference Policy (OAPL), that is off-policy by design.

27.02.2026 22:27 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Video thumbnail

Does LLM RL post-training need to be on-policy?

27.02.2026 22:27 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0