▶️ Watch San Diego Alignment Workshop video: youtu.be/k93o4R145Os&...
📄 neelnanda.io/vision
📄 neelnanda.io/agenda
▶️ Watch San Diego Alignment Workshop video: youtu.be/k93o4R145Os&...
📄 neelnanda.io/vision
📄 neelnanda.io/agenda
Interpretability produced insights but didn't necessarily impact AGI safety. Neel Nanda's pivot: study what works. Anthropic made progress on eval awareness with simple activation steering. His team now grounds work in testable proxy tasks, fails fast on dead ends. 👇
📩 far.ai/futures-eoi
▶️ buff.ly/ZQHnAOA
200+ researchers joined London Alignment Workshop Day 2 for talks on governance, scheming & multi-agent safety. Thanks to Allan Dafoe, Gillian Hadfield, Marius Hobbhahn, Joseph Bloom, Ryan Lowe, Stephen Casper, Sören Mindermann and all speakers! 👇
▶️ youtube.com/playlist?list=PLpvkFqYJXcrdxYK-C4ZRj0cgcco3o0VxF
📩 far.ai/futures-eoi
London Alignment Workshop Day 1 on interpretability, scalable oversight & EU AI policy. Rohin Shah, Neel Nanda, Zachary Kenton, Vincent Conitzer, Owain Evans, James Black, Christopher Summerfield, Matthieu Delescluse, Simon Möller, Victoria Krakovna and more. Ready for Day 2! 👇
"Move fast, break things" isn't appropriate when the stakes are this high. Our CEO @gleave.me told CNBC that coding agents are already replacing engineers. While agentic swarms are overhyped for now, we're building on an insecure substrate that attackers will exploit at scale. 👇
▶️ Fireside chat: www.youtube.com/watch?v=_7OA...
▶️ Keynote address: www.youtube.com/watch?v=Zndi...
Even a 1% chance we all die is not something we can just take lightly. @yoshuabengio.bsky.social explains why he shifted from AI capabilities to safety research, his ChatGPT wakeup call, thinking about his children when evaluating AI risk, and the hopeful path forward. Watch the chat 👇
📄Read the full paper: buff.ly/spnlYip
👥Research by @LukasStruppek, @gleave.me, @kellinpelrine.bsky.social
9/ Open-weight models remain vulnerable to prefill attacks. This vector allows attackers to elicit harmful content – like step-by-step guides for creating malware or CBRNE weapons – with minimal effort. We need defenses against prefilling for secure open-weight deployments.
8/ And we can also tailor attacks to specific models. Generic attacks might fail on models with specific reasoning structures like GPT-OSS, but tailored prefills (like injecting a fake safety analysis) pushes success rates back over 90%.
7/ Do reasoning models fare better?
Not really. With prefills at the start of reasoning, some models refuse in their final output – but only after producing detailed harmful information in the reasoning stage. And sometimes we just bypassed reasoning with end-of-thinking tokens.
6/ Not all prefills are equally effective. Generic prefixes like "Sure, I can help you with that," occasionally work, but more sophisticated approaches like pretending to be an internal system directive or adding fake academic references achieved the highest success rates.
5/ Scale doesn't matter.
Larger parameter counts don't improve robustness, and a 405B model is just as susceptible to prefilling as smaller variants.
4/ What we found: Prefill attack vulnerability is universal.
The attacks succeed against all major contemporary open-weight models. Attack success rates (ASR) exceed 95% in many cases, even on models that typically refuse direct harmful requests.
3/ We evaluated 50 models from across the Llama, Qwen3, DeepSeek-R1, GPT-OSS, Kimi-K2-Thinking, and GLM-4.7 families, and tested 23 strategies ranging from just “Sure, I can help with that…” to more complex prefills that switch languages, impersonate authority, or use logical misdirection.
2/ What's a prefill attack?
Since open-weight models run locally, attackers can force the model to start with something like "Sure, I can help you build a bomb…”, before letting it continue generating on its own. This biases the model away from refusing the user's request.
1/ Open-weight AI models often refuse harmful requests... until you “put a few words in their mouth.”
We conducted the largest study of prefill attacks, and found that state-of-the-art models are consistently vulnerable, with attack success rates approaching 100%.
9/
👥 Research by @matthewkowal.bsky.social, Goncalo Paulo, Louis Jaburi, Tom Tseng, @LevMckinney @sheimersheim @aarondtucker @gleave.me @kellinpelrine.bsky.social
🚀 Interested in making AI systems safer? We're hiring! Check out buff.ly/NvULwyJ
8/
📝 Blog: buff.ly/VwrK4z6
📄 Paper: buff.ly/2dWoJIf
7/
📚Understanding model behavior starts with understanding training data. Concept Influence shows interpretability tools make data attribution more accurate, scalable, and practical, enabling better control over model behaviors through data.
6/
⚡Simple probe-based methods are first-order approximations of Concept Influence.
OASST1: Vector Filter achieves best performance
✅ 5% of data = full capability (67% MTBench)
✅ 3× safer (2.3% → 0.8% harmful)
✅ 20× faster
Interpretability + efficiency = no tradeoff.
5/
🔎Using SAEs to cluster data semantically: For a violent revenge query, influence functions surface generic concepts like "legal terms." Concept Influence identifies actual drivers: "historical oppression," "conspiracy theories" with 1000x higher scores.
4/
⚠️Emergent misalignment: Vector-based methods (Concept Influence, simple probes) match or exceed influence functions and are far more scalable. Training on top 10% most harmful data can produce substantially more misalignment. Small fractions dominate safety behaviors.
3/
🔑Concept Influence replaces test examples with semantic directions. Find data that influences a behavior, not just matches an output.
Use interpretable units
✅ Linear probes: harmful vs safe
✅ Sparse Autoencoder (SAE) features: discovered concepts
✅ Crosscoders: base vs fine-tuned
2/
Why current TDA methods (e.g. influence functions) fail:
❌ Use single test examples → biased toward syntax, miss abstract behaviors
❌ Prioritize lexical overlap → surface matches, not root causes
❌ Computationally expensive → don't scale to modern LLMs
1/
Training data attribution (TDA) is broken: methods are slow and find syntactically similar data, not actual causes. Our solution Concept Influence: semantically meaningful results, better performance, 20x faster approximations. We attribute it to concepts, not examples. 🧵
▶️ FAR.AI 5-minute deception research overview youtu.be/OpqeB-eZx68&...
▶️ FAR.AI 1-hour deep dive into the research agenda www.youtube.com/watch?v=zGZy...