Mirco Mutti's Avatar

Mirco Mutti

@mircomutti

Reinforcement learning, but without rewards. Postdoc at the Technion. PhD from Politecnico di Milano. https://muttimirco.github.io

1,465
Followers
311
Following
36
Posts
20.11.2024
Joined
Posts Following

Latest posts by Mirco Mutti @mircomutti

Post image

๐Ÿ“ฃ Reinforcement Learning Summer School is returning to Milan in 2026!

Co-organized with @ellisunitmilan.bsky.social & designed for Master's and PhD students on RL theory, multi-agent systems, RL & LLMs, real-world applications...

๐Ÿ“ Milan ๐Ÿ‡ฎ๐Ÿ‡น
๐Ÿ“… 3-12 June
โฐ Apply by 27 March
๐Ÿ”— https://bit.ly/4b2Plhp

04.03.2026 14:49 ๐Ÿ‘ 19 ๐Ÿ” 9 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

for inverse:
I like a lot the conceptualization of the problem in the works by Alberto & Filippo, such as
- proceedings.mlr.press/v202/metelli...
- arxiv.org/pdf/2501.07996

(may be biased here bc I collaborated in some of those)

16.12.2025 10:34 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

for imitation:
- arxiv.org/pdf/2503.09722 around "separation between bc in discrete and continuous settings" and followups
- dylanfoster.net/il-tutorial/ tutorial on foundations of imitation learning by Max, Dylan, Adam may also be a useful lookup

16.12.2025 10:29 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Absolutely, come to the poster!
Some say Riccardo's aura will be hovering around

28.11.2025 23:54 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Correct

18.11.2025 20:35 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

No, but since the pc explicitly suggested to post on the 20th, I think most people will comply

18.11.2025 07:42 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

As Transactions on Machine Learning Research (TMLR) grows in number of submissions, we are looking for more reviewers and action editors. Please sign up!

Only one paper to review at a time and <= 6 per year, reviewers report greater satisfaction than reviewing for conferences!

14.10.2025 13:32 ๐Ÿ‘ 10 ๐Ÿ” 12 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 2
PheedLoop PheedLoop: Hybrid, In-Person & Virtual Event Software

๐Ÿ“ฃRegistration for EWRL is now open๐Ÿ“ฃ
Register now ๐Ÿ‘‡ and join us in Tรผbingen for 3 days (17th-19th September) full of inspiring talks, posters and many social activities to push the boundaries of the RL community!

13.08.2025 17:02 ๐Ÿ‘ 8 ๐Ÿ” 4 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 1

Looks interesting, but cannot access the url or find the report anywhere

29.07.2025 06:54 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Thatโ€™s my little #ICML2025 convex RL roundup!

If you know of other cool work in this space (or are working on one), feel free to reply and share.

Hope to see even more work on convex RL variations ๐Ÿš€

n/n

24.07.2025 13:09 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐Ÿ“„Flow density control โ€“ @desariky.bsky.social et al

Bridging convex RL with generative models: How to steer diffusion/flow models to optimize non-linear user-specified utilities (beyond just entropy reg fine tuning)?

๐Ÿ“ EXAIT workshop
๐Ÿ”— openreview.net/pdf?id=zOgAx...

7/n

24.07.2025 13:09 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ“„Towards unsupervised multi-agent RL โ€“ @ricczamboni.bsky.social et al (yours truly!)

Still in the convex Markov games spaceโ€”this work explores more tractable objectives for the learning setting.

๐Ÿ“EXAIT workshop
๐Ÿ”—https://openreview.net/pdf?id=A1518D1Pp9

6/n

24.07.2025 13:09 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ“„Convex Markov games โ€“ Ian Gemp et al

If you can 'convexify' MDPs, so you can do for Markov games.
These two papers lay out a general framework + algorithms for the zero-sum version.

๐Ÿ”—https://openreview.net/pdf?id=yIfCq03hsM
๐Ÿ”—https://openreview.net/pdf?id=dSJo5X56KQ

5/n

24.07.2025 13:09 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ“„The number of trials matters in infinite-horizon MDPs โ€“ @pedrosantospps.bsky.social โ€ฌ et al

A deeper look at how the number of realizations used to compute F affects the convex RL problem in infinite horizon settings.

๐Ÿ”—https://openreview.net/pdf?id=I4jNAbqHnM

4/n

24.07.2025 13:09 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ“„Online episodic convex RL โ€“ Bianca Marni Moreno et al

Regret bounds for online convex RL, where F^t is adversarial and revealed only after each episode (or just evaluated on the given trajectory in a bandit feedback variation)

๐Ÿ”—https://openreview.net/pdf?id=d8xnwqslqq

3/n

24.07.2025 13:09 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ” Convex RL

Standard RL optimizes a linear objective: โŸจd^ฯ€, rโŸฉ.
Convex RL generalizes this to any F(d^ฯ€), where F is non-linear (originally assumed convexโ€”hence the name).

This framework subsumes:
โ€ข Imitation
โ€ข Risk sensitivity
โ€ข State coverage
โ€ข RLHF
...and more.

2/n

24.07.2025 13:09 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Walking around posters at @icmlconf.bsky.social, I was happy to see some buzz around convex RLโ€”a topic Iโ€™ve worked on and strongly believe in.

Thought Iโ€™d share a few ICML papers on this direction. Letโ€™s dive in๐Ÿ‘‡

But firstโ€ฆ what is convex RL?

๐Ÿงต

1/n

24.07.2025 13:09 ๐Ÿ‘ 5 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Preview
A Classification View on Meta Learning Bandits Contextual multi-armed bandits are a popular choice to model sequential decision-making. E.g., in a healthcare application we may perform various tests to asses a patient condition (exploration) and t...

To learn more:

- come at our poster (n. 908) on Thursday morning session #ICML2025

- read the preprint arxiv.org/abs/2504.04505

- watch the seminar youtube.com/watch?v=pNos...

n/n

15.07.2025 15:50 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

This is how we got "A classification view on meta learning bandits", a joint work with awesome collaborators Jeongyeol, Shie, and @aviv-tamar.bsky.social

7/n

15.07.2025 15:50 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The regret bounds depend on an instance-dependent "classification coefficient", which suggests classification really captures the complexity of the problem rather than being a mere implementation tool

6/n

15.07.2025 15:50 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

For the latter, we show exploration is *interpretable*, as it is implemented by a shallow decision tree of simple constant action policies, and *efficient*, giving upper/lower bounds to the regret

5/n

15.07.2025 15:50 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Yes, apparently!
A simple algorithm that classifies the latent (condition) with a decision tree (img above right) and then exploits the best action for the classified latent does the job

4/n

15.07.2025 15:50 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Humans typically develop a standard strategy prescribing a sequence of tests to diagnose the condition before committing to the best treatment (see img left). Can we design a bandit algorithm that learns a similarly interpretable exploration but it's also provably efficient?

3/n

15.07.2025 15:50 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Think about a setting in which we aim to converge on the best treatment (action) for a given patient (context) with some unknown condition (latent). The difference between how humans and bandits approach this same problem is striking:

2/n

15.07.2025 15:50 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Would you trust a bandit algorithm to make decisions on your health or investments? Common exploration mechanisms are efficient but scary.

In our latest work at @icmlconf.bsky.social, we reimagine bandit algorithms to get *efficient* and *interpretable* exploration.

A ๐Ÿงต below

1/n

15.07.2025 15:50 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Here we have an original take on how to make the best of parallel data collection for RL. Don't miss the poster at ICML, we're curious to hear what y'all think!

Kudos to the awesome students Vincenzo and @ricczamboni.bsky.social for their work under the wise supervision of Marcello.

09.07.2025 13:53 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
First, we claim that there exists a unique value function $\Vopt$ that satisfies the following equation: For any $x \in \XX$, we have
\begin{align*}
	\Vopt(x) =
	\max_{a \in \AA} \left \{ r(x,a) + \gamma \int \PKernel(\dx' | x, a) \Vopt(x') \right \}.
\end{align*}
This claim alone, however, does not show that this $\Vopt$ is the same as $V^\piopt$.

The second claim is that $\Vopt$ is indeed the same as $V^{\piopt}$, the optimal value function when $\pi$ is restricted to be within the space of stationary policies.
This claim alone, however, does not preclude the possibility that we can find an ever more performant policy by going beyond the space of stationary policies.

The third claim is that for discounted continuing MDPs, we can always find a stationary policy that is optimal within the space of all stationary and non-stationary policies.

These three claims together show that the Bellman optimality equation reveals the recursive structure of the optimal value function $\Vopt = V^{\piopt}$. There is no policy, stationary or non-stationary, with a value function better than $\Vopt$, for the class of discounted continuing MDPs.

First, we claim that there exists a unique value function $\Vopt$ that satisfies the following equation: For any $x \in \XX$, we have \begin{align*} \Vopt(x) = \max_{a \in \AA} \left \{ r(x,a) + \gamma \int \PKernel(\dx' | x, a) \Vopt(x') \right \}. \end{align*} This claim alone, however, does not show that this $\Vopt$ is the same as $V^\piopt$. The second claim is that $\Vopt$ is indeed the same as $V^{\piopt}$, the optimal value function when $\pi$ is restricted to be within the space of stationary policies. This claim alone, however, does not preclude the possibility that we can find an ever more performant policy by going beyond the space of stationary policies. The third claim is that for discounted continuing MDPs, we can always find a stationary policy that is optimal within the space of all stationary and non-stationary policies. These three claims together show that the Bellman optimality equation reveals the recursive structure of the optimal value function $\Vopt = V^{\piopt}$. There is no policy, stationary or non-stationary, with a value function better than $\Vopt$, for the class of discounted continuing MDPs.

What do we talk about when we talk about the Bellman Optimality Equation?

If we think carefully, we are (implicitly) making three claims.

#FoundationsOfReinforcementLearning #sneakpeek

08.07.2025 23:07 ๐Ÿ‘ 6 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

System is so broken:
- researchers write papers no one reads
- reviewers don't have time to review, shamed to coauthors, use LLMs instead of reading
- authors try to fool said LLMs with prompt injection
- evaling researchers based on # of papers (no time to read)

Dystopic.

07.07.2025 16:15 ๐Ÿ‘ 107 ๐Ÿ” 10 ๐Ÿ’ฌ 10 ๐Ÿ“Œ 5

Congratulations, well deserved!

03.05.2025 11:55 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

All stick, no carrot

03.05.2025 06:35 ๐Ÿ‘ 4 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0