Cassidy Laidlaw's Avatar

Cassidy Laidlaw

@cassidylaidlaw

PhD student at UC Berkeley studying RL and AI safety. https://cassidylaidlaw.com

802
Followers
58
Following
20
Posts
25.11.2024
Joined
Posts Following

Latest posts by Cassidy Laidlaw @cassidylaidlaw

Preview
AssistanceZero: Scalably Solving Assistance Games Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, such as incentives for dec...

Check out the full paper for more details. And download our code to play with the assistant!
arxiv.org/abs/2504.07091
github.com/cassidylaidl...

11.04.2025 22:17 πŸ‘ 7 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

We conclude our paper with a vision of how AssistanceZero could be applied to post-training of LLMs. We think that our approach could remove incentives for deception and other unsafe behavior in LLMs and make them more helpful. We may or may not be already working on this πŸ˜‰

11.04.2025 22:17 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Real human users rate our AssistanceZero assistant much higher than one trained via a pretraining+SFT pipeline! And, it enables people to build houses while placing fewer blocks than building alone.

11.04.2025 22:17 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

Our new RL algorithm, AssistanceZero, trains an assistant that displays emergent helpful behaviors like *active learning* and *learning from corrections*.

11.04.2025 22:17 πŸ‘ 6 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

In Minecraft, we use an assistance game formulation where a simulated human is given random houses to build, and an AI assistants learns via RL to help the human out. The assistant can't see the goal house, so it has to predict the goal and maintain uncertainty to be helpful.

11.04.2025 22:17 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Unlike RLHF, assistance games explicitly treat the user-assistant interaction as a two player game, where the user knows their goal but the assistant doesn't. AGs model *communication* about the goal from the user to the assistant and *collaboration* between them to achieve it.

11.04.2025 22:17 πŸ‘ 5 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

A better assistant would maintain *uncertainty* about its goal and ask clarification questions until it really understood, leading to a better solution. Assistance games can enable this.

11.04.2025 22:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

RLHF is great but it encourages short-term optimization: trying to solve the user's entire problem in a single response. For example, if you ask ChatGPT to "clean up some disk space," it will immediately give you a program to run without asking which files are okay to delete!

11.04.2025 22:17 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

We built an AI assistant that plays Minecraft with you.
Start building a houseβ€”it figures out what you’re doing and jumps in to help.

This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧡

11.04.2025 22:17 πŸ‘ 26 πŸ” 4 πŸ’¬ 1 πŸ“Œ 2
Preview
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizi...

Our work provides a more principled step towards preventing reward hacking and ensuring the safety of increasingly powerful AI. Check out the paper for all the details!

arxiv.org/abs/2403.03185

Joint with @shivs01.bsky.social and Anca Dragan

19.12.2024 17:17 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Action distribution and occupancy measure regularization are equivalent for most of today's RLHF implementations (which are effectively contextual bandits). However, once LLMs are optimized for multi-turn interaction or tool use this will no longer be the case.

19.12.2024 17:17 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Experiments show that χ² occupancy measure regularization outperforms KL action distribution regularization in all the environments we study! Our regularization scheme allows for larger improvements in true reward compared to base policies while preventing reward hacking.

19.12.2024 17:17 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Regularization is already used to prevent reward hacking in RLHF, but our theory suggests two key changes: regularize based on occupancy measures rather than action distributions and use χ² divergence instead of KL divergence.

19.12.2024 17:17 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Our definition also leads to a principled method for preventing reward hacking: regularize optimization to the base policy based on χ² occupancy measure divergence. We prove that this regularized objective gives a lower bound on improvement in the true reward.

19.12.2024 17:17 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We define reward hacking as when optimizing a proxy breaks the correlation, resulting in lower true reward than the base policy. Our definition captures intuitive cases of reward hacking in realistic environments, including RLHF, traffic control, and glucose monitoring.

19.12.2024 17:17 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We argue that a good proxy *correlates* with the true reward for states and actions sampled from some reasonable "base policy.” For example, in RLHF a natural base policy is the SFT policy.

19.12.2024 17:17 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

However, formally defining reward hacking is tricky because we have to define what makes a proxy reward "reasonable." If we optimize a reward function that's totally unrelated to our objective, then it's unsurprising that it doesn't work and it arguably isn't "reward hacking."

19.12.2024 17:17 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Reward hacking is when we optimize a reward function that seems reasonable, but it ceases to be a good proxy and we end up with a policy that performs poorly under the unknown "true" reward function. It's ubiquitous because real-world objectives are really hard to specify.

19.12.2024 17:17 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

When RLHFed models engage in β€œreward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧡

19.12.2024 17:17 πŸ‘ 8 πŸ” 3 πŸ’¬ 2 πŸ“Œ 0

Thanks for the shoutout! And for giving me a reason to finally get on bluesky πŸ™‚

25.11.2024 02:12 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
e introduce the effective horizon, a property of
MDPs that controls how difficult RL is. Our analysis is mo-
tivated by Greedy Over Random Policy (GORP), a simple
Monte Carlo planning algorithm (left) that exhaustively ex-
plores action sequences of length k and then uses m random
rollouts to evaluate each leaf node. The effective horizon
combines both k and m into a single measure. We prove
sample complexity bounds based on the effective horizon that
correlate closely with the real performance of PPO, a deep
RL algorithm, on our BRIDGE dataset of 155 deterministic
MDPs (right).

e introduce the effective horizon, a property of MDPs that controls how difficult RL is. Our analysis is mo- tivated by Greedy Over Random Policy (GORP), a simple Monte Carlo planning algorithm (left) that exhaustively ex- plores action sequences of length k and then uses m random rollouts to evaluate each leaf node. The effective horizon combines both k and m into a single measure. We prove sample complexity bounds based on the effective horizon that correlate closely with the real performance of PPO, a deep RL algorithm, on our BRIDGE dataset of 155 deterministic MDPs (right).

Kind of a broken record here but proceedings.neurips.cc/paper_files/...
is totally fascinating in that it postulates two underlying, measurable structures that you can use to assess if RL will be easy or hard in an environment

23.11.2024 18:18 πŸ‘ 151 πŸ” 28 πŸ’¬ 8 πŸ“Œ 2