Sagnik Anupam (@sagnikanupam)

Work done with Lianghuan Huang, Insup Lee, Shuo Li, and @obastani.bsky.social!

14.10.2025 15:27 👍 1 🔁 0 💬 0 📌 0

RAPID: An Efficient Reinforcement Learning Algorithm for Small Language Models Reinforcement learning (RL) has emerged as a promising strategy for finetuning small language models (SLMs) to solve targeted tasks such as math and coding. However, RL algorithms tend to be resource-...

See paper (arxiv.org/abs/2510.03515) for more detailed analyses on the influence of off-policy training on accuracy, time, and sample staleness!

14.10.2025 15:16 👍 2 🔁 0 💬 1 📌 0

Results for Qwen2.5-1.5B, Llama3.2-1B, Gemma3

Our results generalize well to different model sizes (0.5B, 1B, 1.5B) and families (Qwen, Llama, Gemma).

14.10.2025 15:16 👍 1 🔁 0 💬 1 📌 0

Group advantage estimation formula used in RAPID

For off-policy updates, we incorporate group advantage estimation into the policy gradient algorithm, and derive an importance weighted estimator to correct for the bias arising from off-policy learning.

14.10.2025 15:16 👍 1 🔁 0 💬 1 📌 0

RL tends to be costly due to the need to perform both inference and backpropagation during training. To maximize use of computational resources, our algorithm performs inference in large batches, and then performs off-policy policy gradient updates in mini-batches.

14.10.2025 15:16 👍 0 🔁 0 💬 1 📌 0

We run all our experiments on only 4 A6000 GPUs, alternating inference and back-propagation, and our algorithm reduces training time by 34% for MBPP+, 32% for MiniF2F, and 11% for MATH when compared to the strongest baseline, while maintaining similar or better accuracy.

14.10.2025 15:16 👍 1 🔁 0 💬 1 📌 0

Table showing MBPP+, MATH, and MiniF2F results for SFT, GRPO, DAPO, PG, and RAPID.

Having only limited number of GPUs to train your language model? We introduce RAPID, a novel RL algorithm that can substantially reduce the post-training time of small language models under resource-constrained scenarios.

14.10.2025 15:16 👍 1 🔁 0 💬 1 📌 0

Work done with @davisbrown.bsky.social, Shuo Li, @profericwong.bsky.social, Hamed Hassani, @obastani.bsky.social!

14.10.2025 06:14 👍 0 🔁 0 💬 0 📌 0

What do we find? o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models. DeepSeek-R1, on the other hand, will consistently claim to close pop-up banners even when it has not done so.

14.10.2025 06:14 👍 0 🔁 0 💬 1 📌 0

To identify failure modes, we have humans label each agent action. We cluster and label these annotations with @transluce.bsky.social's Docent, and discover 3 failure modes that we reproduce and study at scale: captcha resolution, pop-up banner removal, and direct navigation to URLs.

14.10.2025 06:14 👍 0 🔁 0 💬 1 📌 0

Example user-submitted task: “Find me the last available train from Cardiff Central to Barry Docks station today on trainline”

Deepseek-R1 GIF:

14.10.2025 06:14 👍 0 🔁 0 💬 1 📌 0

BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks LLM web agents now browse and take actions on the open web, yet current agent evaluations are constrained to sandboxed environments or artificial tasks. We introduce BrowserArena, a live open-web agen...

Paper: arxiv.org/abs/2510.02418
Code: github.com/sagnikanupam...

14.10.2025 06:14 👍 0 🔁 0 💬 1 📌 0

Introducing an evaluation platform for web agents–BrowserArena! Combining the awesome @lmarena.bsky.social platform with BrowserUse, we rank LLMs side-by-side to compare their ability to solve web navigation tasks!

Users vote for models using GIFs and text outputs to judge task performance.

14.10.2025 06:14 👍 2 🔁 0 💬 1 📌 0

Sagnik Anupam

Latest posts by Sagnik Anupam @sagnikanupam