Norman Mu's Avatar

Norman Mu

@normanmu.com

prev: safety lead xAI, Berkeley EECS PhD

176
Followers
263
Following
14
Posts
03.11.2023
Joined
Posts Following

Latest posts by Norman Mu @normanmu.com

Preview
A Closer Look at System Prompt Robustness System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output forma...

Lots more in our paper (arxiv.org/abs/2502.12197) and code (github.com/normster/Rea...)

19.02.2025 06:06 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Reasoning (o3-mini and R1) seems highly effective for more retrieval-bottlenecked prompts (i.e. forgetting relevant guardrails), less so for adversarial inputs/prompt injections. Definitely an exciting direction to explore further.

19.02.2025 06:06 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Standard training techniques like good data curation, SFT -> DPO, work reasonably well, and the pass/fail nature of guardrail adherence enables the use of tricks like classifier-free guidance/contrastive decoding to further improve performance

19.02.2025 06:06 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

RealGuardrails is our new dataset to 1) evaluate system prompt robustness on realistic prompts scraped from the ChatGPT store, and 2) evaluate methods for improving open weight models like Llama 3

19.02.2025 06:06 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

System prompts are a critical control plane for LLMs/AI agents, but models vary widely in their robustness. We found a "complexity wall" at ~10 guardrails where prompt adherence declines rapidly even on totally benign inputs

19.02.2025 06:06 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I learned about how legal journals work earlier this year and have wondered if it could work for ML/AI: reviewing becomes a way for students to distinguish themselves rather than a chore

13.12.2024 19:12 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

mindboggling that bytedance is 1) suing the author for damages and sabotage and 2) keeping their name on the paper/award without retracting it

11.12.2024 08:31 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Ethical Challenges Related to the NeurIPS 2024 Best Paper Award

absolutely wild allegations being shared about the author of the NeurIPS 2024 best paper winner: var-integrity-report.github.io.

really feels like the bottom is dropping out from academic ML research with the amount of brazen dishonesty

11.12.2024 08:26 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Going off FLOP/s and power, looks like these are very roughly 3/4 of an H100?

03.12.2024 20:58 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Amazon’s AI Self Sufficiency | Trainium2 Architecture & Networking Amazon is currently conducting one of the largest build-out of AI clusters globally, deploying a considerable number of Hopper and Blackwell GPUs. In addition to a massive Capex invested into Nvidi…

"AWS is currently deploying a cluster with 200k+ Trainium2 chips for Anthropic called β€œProject Rainer”" semianalysis.com/2024/12/03/a...

03.12.2024 20:56 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

top loss curves do look sketchy but idk how bad this is for an RL task. bottom curves don't obviously asymptote but they also shouldn't with learning rate decay (which the open source release seems to use? github.com/google-resea...)

01.12.2024 09:12 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

my takeaway from Vighnesh's writeup: google's main argument about under-training/lack of pre-training isn't super convincing. convergence of train loss is not always desirable. early stopping can help, and their own paper shows mixed results on importance of pre-training

01.12.2024 09:03 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Reinforcement Learning for Chip Floorplanning: The Saga

i've been confusedly semi-following the RL chip design/AlphaChip story for a few years trying to understand what the core disagreement was, and finally found a summary that (seems) to collate and explain all the relevant artifacts

01.12.2024 08:45 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I’d like to join!

15.01.2024 02:55 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0