Work done with Lianghuan Huang, Insup Lee, Shuo Li, and @obastani.bsky.social!
@sagnikanupam
CIS PhD at Penn | MIT CS + Math '24 sagnikanupam.com PhD student working on AI reasoning in large multimodal models. I design methods to build better models for math, code, visual reasoning, agents, and robotics.
Work done with Lianghuan Huang, Insup Lee, Shuo Li, and @obastani.bsky.social!
See paper (arxiv.org/abs/2510.03515) for more detailed analyses on the influence of off-policy training on accuracy, time, and sample staleness!
Results for Qwen2.5-1.5B, Llama3.2-1B, Gemma3
Our results generalize well to different model sizes (0.5B, 1B, 1.5B) and families (Qwen, Llama, Gemma).
Group advantage estimation formula used in RAPID
For off-policy updates, we incorporate group advantage estimation into the policy gradient algorithm, and derive an importance weighted estimator to correct for the bias arising from off-policy learning.
RL tends to be costly due to the need to perform both inference and backpropagation during training. To maximize use of computational resources, our algorithm performs inference in large batches, and then performs off-policy policy gradient updates in mini-batches.
We run all our experiments on only 4 A6000 GPUs, alternating inference and back-propagation, and our algorithm reduces training time by 34% for MBPP+, 32% for MiniF2F, and 11% for MATH when compared to the strongest baseline, while maintaining similar or better accuracy.
Table showing MBPP+, MATH, and MiniF2F results for SFT, GRPO, DAPO, PG, and RAPID.
Having only limited number of GPUs to train your language model? We introduce RAPID, a novel RL algorithm that can substantially reduce the post-training time of small language models under resource-constrained scenarios.
Work done with @davisbrown.bsky.social, Shuo Li, @profericwong.bsky.social, Hamed Hassani, @obastani.bsky.social!
What do we find? o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models. DeepSeek-R1, on the other hand, will consistently claim to close pop-up banners even when it has not done so.
To identify failure modes, we have humans label each agent action. We cluster and label these annotations with @transluce.bsky.social's Docent, and discover 3 failure modes that we reproduce and study at scale: captcha resolution, pop-up banner removal, and direct navigation to URLs.
Example user-submitted task: βFind me the last available train from Cardiff Central to Barry Docks station today on trainlineβ
Deepseek-R1 GIF:
Paper: arxiv.org/abs/2510.02418
Code: github.com/sagnikanupam...
Introducing an evaluation platform for web agentsβBrowserArena! Combining the awesome @lmarena.bsky.social platform with BrowserUse, we rank LLMs side-by-side to compare their ability to solve web navigation tasks!
Users vote for models using GIFs and text outputs to judge task performance.