FAR.AI (@far.ai) — bluesky.baby

Neel Nanda - Our Pivot To Pragmatic Interpretability [Alignment Workshop] Neel Nanda (Google DeepMind) discussed his mechanistic interpretability team's pivot from ambitious reverse-engineering goals to more pragmatic work that can be empirically validated on today's…

▶️ Watch San Diego Alignment Workshop video: youtu.be/k93o4R145Os&...
📄 neelnanda.io/vision
📄 neelnanda.io/agenda

06.03.2026 16:31 👍 0 🔁 0 💬 0 📌 0

Interpretability produced insights but didn't necessarily impact AGI safety. Neel Nanda's pivot: study what works. Anthropic made progress on eval awareness with simple activation steering. His team now grounds work in testable proxy tasks, fails fast on dead ends. 👇

06.03.2026 16:31 👍 2 🔁 0 💬 1 📌 0

📩 far.ai/futures-eoi
▶️ buff.ly/ZQHnAOA

03.03.2026 22:08 👍 0 🔁 0 💬 0 📌 0

200+ researchers joined London Alignment Workshop Day 2 for talks on governance, scheming & multi-agent safety. Thanks to Allan Dafoe, Gillian Hadfield, Marius Hobbhahn, Joseph Bloom, Ryan Lowe, Stephen Casper, Sören Mindermann and all speakers! 👇

03.03.2026 22:08 👍 1 🔁 0 💬 1 📌 0

Alignment Workshop The Alignment Workshop is a series of events convening top ML researchers from industry and academia, along with experts in the government and nonprofit sect...

▶️ youtube.com/playlist?list=PLpvkFqYJXcrdxYK-C4ZRj0cgcco3o0VxF
📩 far.ai/futures-eoi

03.03.2026 08:52 👍 1 🔁 0 💬 0 📌 0

London Alignment Workshop Day 1 on interpretability, scalable oversight & EU AI policy. Rohin Shah, Neel Nanda, Zachary Kenton, Vincent Conitzer, Owain Evans, James Black, Christopher Summerfield, Matthieu Delescluse, Simon Möller, Victoria Krakovna and more. Ready for Day 2! 👇

03.03.2026 08:52 👍 3 🔁 1 💬 1 📌 0

What worries the CEO of safety group FAR.AI most about the race to dominate the AI market Adam Gleave, co-founder & CEO of FAR.AI, and former Google DeepMind employee, discusses potential threats AI could pose to humanity, including job losses, and how companies can develop and ensure…

▶️ Watch the interview at www.cnbc.com/video/2026/0...

26.02.2026 18:56 👍 0 🔁 0 💬 0 📌 0

"Move fast, break things" isn't appropriate when the stakes are this high. Our CEO @gleave.me told CNBC that coding agents are already replacing engineers. While agentic swarms are overhyped for now, we're building on an insecure substrate that attackers will exploit at scale. 👇

26.02.2026 18:56 👍 2 🔁 0 💬 1 📌 0

Yoshua Bengio - Fireside Chat with Yoshua Bengio [Alignment Workshop] Turing Award winner Yoshua Bengio (MILA, Law Zero) explains why he shifted from pioneering deep learning to warning about its dangers. In this fireside chat with Adam Gleave (FAR.AI), Bengio…

▶️ Fireside chat: www.youtube.com/watch?v=_7OA...
▶️ Keynote address: www.youtube.com/watch?v=Zndi...

25.02.2026 16:30 👍 0 🔁 0 💬 0 📌 0

Even a 1% chance we all die is not something we can just take lightly. @yoshuabengio.bsky.social explains why he shifted from AI capabilities to safety research, his ChatGPT wakeup call, thinking about his children when evaluating AI risk, and the hopeful path forward. Watch the chat 👇

25.02.2026 16:30 👍 0 🔁 0 💬 1 📌 1

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks As the capabilities of large language models continue to advance, so does their potential for misuse. While closed-source models typically rely on external defenses, open-weight models must primarily…

📄Read the full paper: buff.ly/spnlYip
👥Research by @LukasStruppek, @gleave.me, @kellinpelrine.bsky.social

24.02.2026 16:15 👍 0 🔁 0 💬 0 📌 0

9/ Open-weight models remain vulnerable to prefill attacks. This vector allows attackers to elicit harmful content – like step-by-step guides for creating malware or CBRNE weapons – with minimal effort. We need defenses against prefilling for secure open-weight deployments.

24.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

8/ And we can also tailor attacks to specific models. Generic attacks might fail on models with specific reasoning structures like GPT-OSS, but tailored prefills (like injecting a fake safety analysis) pushes success rates back over 90%.

24.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

7/ Do reasoning models fare better?

Not really. With prefills at the start of reasoning, some models refuse in their final output – but only after producing detailed harmful information in the reasoning stage. And sometimes we just bypassed reasoning with end-of-thinking tokens.

24.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

6/ Not all prefills are equally effective. Generic prefixes like "Sure, I can help you with that," occasionally work, but more sophisticated approaches like pretending to be an internal system directive or adding fake academic references achieved the highest success rates.

24.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

5/ Scale doesn't matter.

Larger parameter counts don't improve robustness, and a 405B model is just as susceptible to prefilling as smaller variants.

24.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

4/ What we found: Prefill attack vulnerability is universal.

The attacks succeed against all major contemporary open-weight models. Attack success rates (ASR) exceed 95% in many cases, even on models that typically refuse direct harmful requests.

24.02.2026 16:15 👍 1 🔁 1 💬 1 📌 0

3/ We evaluated 50 models from across the Llama, Qwen3, DeepSeek-R1, GPT-OSS, Kimi-K2-Thinking, and GLM-4.7 families, and tested 23 strategies ranging from just “Sure, I can help with that…” to more complex prefills that switch languages, impersonate authority, or use logical misdirection.

24.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

2/ What's a prefill attack?

Since open-weight models run locally, attackers can force the model to start with something like "Sure, I can help you build a bomb…”, before letting it continue generating on its own. This biases the model away from refusing the user's request.

24.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

1/ Open-weight AI models often refuse harmful requests... until you “put a few words in their mouth.”

We conducted the largest study of prefill attacks, and found that state-of-the-art models are consistently vulnerable, with attack success rates approaching 100%.

24.02.2026 16:15 👍 2 🔁 2 💬 1 📌 0

9/
👥 Research by @matthewkowal.bsky.social, Goncalo Paulo, Louis Jaburi, Tom Tseng, @LevMckinney @sheimersheim @aarondtucker @gleave.me @kellinpelrine.bsky.social

🚀 Interested in making AI systems safer? We're hiring! Check out buff.ly/NvULwyJ

23.02.2026 16:15 👍 0 🔁 0 💬 0 📌 0

8/
📝 Blog: buff.ly/VwrK4z6
📄 Paper: buff.ly/2dWoJIf

23.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

7/
📚Understanding model behavior starts with understanding training data. Concept Influence shows interpretability tools make data attribution more accurate, scalable, and practical, enabling better control over model behaviors through data.

23.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

6/
⚡Simple probe-based methods are first-order approximations of Concept Influence.

OASST1: Vector Filter achieves best performance
✅ 5% of data = full capability (67% MTBench)
✅ 3× safer (2.3% → 0.8% harmful)
✅ 20× faster

Interpretability + efficiency = no tradeoff.

23.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

5/
🔎Using SAEs to cluster data semantically: For a violent revenge query, influence functions surface generic concepts like "legal terms." Concept Influence identifies actual drivers: "historical oppression," "conspiracy theories" with 1000x higher scores.

23.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

4/
⚠️Emergent misalignment: Vector-based methods (Concept Influence, simple probes) match or exceed influence functions and are far more scalable. Training on top 10% most harmful data can produce substantially more misalignment. Small fractions dominate safety behaviors.

23.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

3/
🔑Concept Influence replaces test examples with semantic directions. Find data that influences a behavior, not just matches an output.

Use interpretable units
✅ Linear probes: harmful vs safe
✅ Sparse Autoencoder (SAE) features: discovered concepts
✅ Crosscoders: base vs fine-tuned

23.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

2/
Why current TDA methods (e.g. influence functions) fail:

❌ Use single test examples → biased toward syntax, miss abstract behaviors
❌ Prioritize lexical overlap → surface matches, not root causes
❌ Computationally expensive → don't scale to modern LLMs

23.02.2026 16:15 👍 0 🔁 0 💬 1 📌 0

1/
Training data attribution (TDA) is broken: methods are slow and find syntactically similar data, not actual causes. Our solution Concept Influence: semantically meaningful results, better performance, 20x faster approximations. We attribute it to concepts, not examples. 🧵

23.02.2026 16:15 👍 1 🔁 1 💬 1 📌 0

▶️ FAR.AI 5-minute deception research overview youtu.be/OpqeB-eZx68&...
▶️ FAR.AI 1-hour deep dive into the research agenda www.youtube.com/watch?v=zGZy...

18.02.2026 16:30 👍 0 🔁 0 💬 0 📌 0

FAR.AI

Latest posts by FAR.AI @far.ai