Amir Mesbah (@amirmesbah)

Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration....

Your welcome!
This is also in my reading list, as an application of IL in offline-to-online learning
- Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning arxiv.org/abs/2509.26605

16.12.2025 07:47 👍 0 🔁 0 💬 0 📌 0

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision making task by learning from demonstrations, and has been widely applied to robotics, autonomous driving, and a...

- Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning arxiv.org/abs/2407.15007

And this "invitation" was also very intuitive
- An Invitation to Imitation www.ri.cmu.edu/publications...

16.12.2025 00:57 👍 1 🔁 0 💬 1 📌 0

Please, don't automate science! I was at an event on AI for science yesterday, a panel discussion here at NeurIPS. The panelists discussed how they plan to replace humans a...

was at an event on AI for science yesterday, a panel discussion here at NeurIPS. The panelists discussed how they plan to replace humans at all levels in the scientific process. So I stood up and protested that what they are doing is evil.

Full post:
togelius.blogspot.com/2025/12/plea...

08.12.2025 06:51 👍 270 🔁 68 💬 29 📌 33

ICML 20256 Call For Tutorials

📣 #ICML tutorials: We want to know what *you* would like to learn. This year, Adam White and I are calling for nominations of topics and/or presenters.

Until December 7th, you can send us your suggestions, and we will use them to shape the program.

icml.cc/Conferences/...

12.11.2025 08:12 👍 14 🔁 9 💬 0 📌 0

🚨The Formalism-Implementation Gap in RL research🚨

Lots of progress in RL research over last 10 years, but too much performance-driven => overfitting to benchmarks (like the ALE).

1⃣ Let's advance science of RL
2⃣ Let's be explicit about how benchmarks map to formalism

1/X

28.10.2025 13:55 👍 44 🔁 5 💬 1 📌 2

Put CASH on Bandits: A Max K-Armed Problem for Automated Machine Learning The Combined Algorithm Selection and Hyperparameter optimization (CASH) is a challenging resource allocation problem in the field of AutoML. We propose MaxUCB, a max $k$-armed bandit method to trade o...

I am happy to share that our paper "Put CASH on Bandits: A Max K-Armed Problem for Automated Machine Learning" has been accepted at NeurIPS 2025!

Endless thanks to my amazing co-authors @claireve.bsky.social and @keggensperger.bsky.social

📄 Read it on arXiv: arxiv.org/abs/2505.05226

(1/3)

06.10.2025 16:53 👍 7 🔁 1 💬 1 📌 1

a close up of a sad cat with the words pleeeaasse written below it ALT: a close up of a sad cat with the words pleeeaasse written below it

cvoelcker.de/blog/2025/re...

I finally gave in and made a nice blog post about my most recent paper. This was a surprising amount of work, so please be nice and go read it!

02.10.2025 21:34 👍 29 🔁 7 💬 0 📌 3

Thanks a lot! That was lightning fast 🚀

02.10.2025 22:26 👍 0 🔁 0 💬 0 📌 0

Relative Entropy Pathwise Policy Optimization - Technical Overview | Claas A. Voelcker A lightweight overview of the new REPPO algorithm

cvoelcker.de/blog/2025/re...

Here ya go!

02.10.2025 21:31 👍 1 🔁 1 💬 1 📌 0

Maybe a blog post would also help =)

26.09.2025 14:59 👍 0 🔁 0 💬 2 📌 0

Could you add me please?

09.09.2025 21:59 👍 1 🔁 0 💬 1 📌 0

Definition of dynamic programming in RL, from Csaba Szepesvári’s RL theory lecture notes (Lecture 2, "Planning in MDPs")

Definition of dynamic programming, from Puterman’s Markov Decision Processes — chapter 1.

I came across a couple of other definitions that might be helpful to mention (apologies if you’re already considering these).
The first one is from Csaba Szepesvári’s RL theory lecture notes (lecture 2, planning in MDPs), and the second one is from Puterman's MDP book (chapter 1).

04.08.2025 09:45 👍 2 🔁 0 💬 1 📌 0

What are we talking about when we talk about Dynamic Programming?

#ReinforcementLearning

03.08.2025 20:14 👍 8 🔁 1 💬 2 📌 0

What if all mathematicians had great visualization skills, tools, and public notes!

31.07.2025 16:22 👍 1 🔁 0 💬 0 📌 0

Onno and I will be presenting our poster at # W1005 tomorrow (Wed) morning.
He made a great thread about it, come chat with us about POMDP theory :)

16.07.2025 03:45 👍 19 🔁 5 💬 0 📌 0

I will not be at #ICML2025 this year, but 3 of my PhD students at 🤖 Adage (Adaptive Agents Lab) 🤖 are, presenting 3 papers.
⭐ Avery Ma
⭐ Claas Voelcker (cvoelcker.bsky.social)
⭐ Tyler Kastner

Meet them to talk about Model-based RL, Distributional RL, and Jailbreaking LLMs.

14.07.2025 18:54 👍 5 🔁 2 💬 1 📌 0

Levine's take on the success of LLMs compared to video models is interesting, but I'll expand on how efforts toward AI could take two different paths, and why I think AI and NeuroAI could take different approaches moving forward. 🧵

🧠🤖 #MLSky

12.06.2025 14:30 👍 7 🔁 2 💬 1 📌 0

Preprint Alert 🚀

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧵 (1/10)

14.05.2025 12:52 👍 51 🔁 16 💬 1 📌 5

GitHub - vwxyzjn/cleanrl: High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG) High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG) - vwxyzjn/cleanrl

cleanrl is amazing (github.com/vwxyzjn/clea...) and its structure makes sense for teaching but an actual research codebase should not inherit this style! you do not want this amount of code duplication

11.05.2025 20:01 👍 32 🔁 2 💬 4 📌 0

Reinforcement Learning from Human Feedback Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle…

rlhfbook also available on arxiv for SEO 😀 happy friday
arxiv.org/abs/2504.12501

18.04.2025 16:07 👍 69 🔁 13 💬 3 📌 4

Reinforcement Learning (RL) for LLMs YouTube video by Natasha Jaques

Recorded a recent "talk" / rant about RL fine-tuning of LLMs for a guest lecture in Stanford CSE234: youtube.com/watch?v=NTSY.... Covers some of my lab's recent work on personalized RLHF, as well as some mild Schmidhubering about my own early contributions to this space

27.03.2025 21:31 👍 51 🔁 10 💬 5 📌 1

PQN puts Q-learning back on the map and now comes with a blog post + Colab demo! Also, congrats to the team for the spotlight at #ICLR2025

20.03.2025 11:51 👍 16 🔁 4 💬 0 📌 0

Happy #Nowruz and the beginning of the spring!

20.03.2025 17:37 👍 0 🔁 0 💬 0 📌 0

I wanted to send you the link just now but hopefully you have found it =)

18.03.2025 21:08 👍 1 🔁 0 💬 0 📌 0

Sure *_*
Looking forward to it :)

17.03.2025 20:55 👍 0 🔁 0 💬 0 📌 0

Not yet. Just the classical claim that they're trying to learn the distribuition of the return =))
Do yo have any insights?

17.03.2025 18:37 👍 0 🔁 0 💬 1 📌 0

I was reading about the ways that I can enhance the performance of dqn on a real-world problem. One of the candidates was c51 but i haven't implement it yet becuase of computational costs. But it was interesting for becuase i haven't read the papers before

17.03.2025 14:23 👍 1 🔁 0 💬 1 📌 0

I didn't know until last week that it can cause a huge performance boost using it with dqn.

17.03.2025 14:06 👍 1 🔁 0 💬 1 📌 0

Claire Vernade - European career opportunities European Academic Career Opportunities in 2025

I’ve put together a short list of opportunities for early career academics willing to come to Europe: www.cvernade.com/miscellaneou...

This mostly covers France and Germany for now but I’m willing to extend it. I build on @ellis.eu resources and my own knowledge of these systems.

11.03.2025 09:19 👍 75 🔁 26 💬 3 📌 0

Andrew Barto and Richard Sutton are the recipients of the 2024 ACM A.M. Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning. Andrew Barto and Richard Sutton as the recipients of the 2024 ACM A.M. Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning. In a series of papers beginning...

RL is so back!

(well, for some of us, it never really left)

awards.acm.org/about/2024-t...

05.03.2025 10:41 👍 72 🔁 12 💬 1 📌 1

Amir Mesbah

Latest posts by Amir Mesbah @amirmesbah