Dana Arad (@danaarad) — bluesky.baby

Now accepted to EMNLP Main Conference!

20.08.2025 19:38 👍 4 🔁 0 💬 0 📌 0

Submit your work to #BlackboxNLP 2025!

12.08.2025 19:13 👍 3 🔁 0 💬 0 📌 0

Excited to spend the rest of the summer visiting @davidbau.bsky.social's lab at Northeastern! If you’re in the area and want to chat about interpretability, let me know ☕️

10.08.2025 13:56 👍 9 🔁 1 💬 0 📌 0

In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!

Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!

27.07.2025 06:10 👍 3 🔁 1 💬 0 📌 0

10 days to go! Still time to run your method and submit!

23.07.2025 08:21 👍 1 🔁 1 💬 0 📌 0

Three weeks is plenty of time to submit your method!

13.07.2025 06:12 👍 0 🔁 0 💬 0 📌 0

What are you working on for the MIB shared task?

Check out the full task description here: blackboxnlp.github.io/2025/task/

09.07.2025 07:21 👍 0 🔁 0 💬 0 📌 0

New to mechanistic interpretability?
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation

Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR

07.07.2025 08:42 👍 9 🔁 3 💬 0 📌 0

Same Task, Different Circuits – Project Page

In this work we take a step towards understanding and mitigating the vision-language performance gap, but there's still more to explore!

This was an awesome collaboration w\ Yossi Gandelsman, @boknilev.bsky.social, led by Yaniv Nikankin 🤩

Paper and code: technion-cs-nlp.github.io/vlm-circuits...

26.06.2025 10:40 👍 1 🔁 0 💬 0 📌 0

By simply patching visual data tokens from later layers back into earlier ones, we improve of 4.6% on average - closing a third of the gap!

26.06.2025 10:40 👍 1 🔁 0 💬 1 📌 0

4. Zooming on data positions, we show that visual representations gradually align with their textual analogs across model layers (also shown by
@zhaofeng_wu
et al.). We hypothesize this may happen too late in the model to process the information, and fix it with back-patching.

26.06.2025 10:40 👍 1 🔁 0 💬 1 📌 0

3. Data sub-circuits, however, are modality-specific; Swapping them significantly degrades performance. This is critical - this highlights that the differences in data processing are a key factor in the performance gap.

26.06.2025 10:40 👍 1 🔁 0 💬 1 📌 0

2. Structure is only half the story: different circuits can still implement similar logic. We swap sub-circuits between modalities to measure cross-modal faithfulness.
Turns out, query and generation sub-circuits are functionally equivalent, retaining faithfulness when swapped!

26.06.2025 10:40 👍 1 🔁 0 💬 1 📌 0

1. Circuits for the same task are mostly structurally disjoint, with an average of only 18% components shared between modalities!
The overlap is extremely low in data and query positions, and moderate in the generation (last) position only.

26.06.2025 10:40 👍 1 🔁 0 💬 1 📌 0

We identify circuits (task-specific computational sub-graphs composed of attention heads and MLP neurons) used by VLMs to solve both variants.
What did we find? >>

26.06.2025 10:40 👍 1 🔁 0 💬 1 📌 0

Consider object counting: we can ask a VLM “how many books are there?” given either an image or a sequence of words. Like Kaduri et al., we consider three types of positions within the input - data (image or word sequence), query ("how many..."), and generation (last token).

26.06.2025 10:40 👍 1 🔁 0 💬 1 📌 0

VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?

In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵

26.06.2025 10:40 👍 6 🔁 4 💬 1 📌 0

Working on circuit discovery in LMs?
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!

The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>

24.06.2025 14:24 👍 5 🔁 4 💬 1 📌 0

Have you heard about this year's shared task? 📢

Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵

23.06.2025 14:45 👍 16 🔁 4 💬 1 📌 1

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

27.05.2025 17:07 👍 16 🔁 1 💬 1 📌 0

Thank you! Added to my reading list ☺️

28.05.2025 05:30 👍 1 🔁 0 💬 0 📌 0

Should work now!

28.05.2025 05:29 👍 1 🔁 0 💬 0 📌 0

SAEs Are Good for Steering -- If You Select the Right Features Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output...

SAEs have sparked a debate over their utility; we hope to add another perspective. Would love to hear your thoughts!

Paper: arxiv.org/abs/2505.20063
Code: github.com/technion-cs-...

Huge thanks to ‪@boknilev.bsky.social‬, ‪@amuuueller.bsky.social‬, it’s been great working on this project with you!

27.05.2025 16:06 👍 5 🔁 0 💬 1 📌 0

These findings have practical implications: after filtering out features with low output scores, we see 2-3x improvements for steering with SAEs, making them competitive with supervised methods on AxBench, a recent steering benchmark ( Wu and ‪@aryaman.io‬ et al.)

27.05.2025 16:06 👍 2 🔁 0 💬 1 📌 0

We show that high scores rarely co-occur, and emerge at different layers: features in earlier layers primarily detect input patterns, while features in later layers are more likely to drive the model’s outputs, consistent with prior analyses of LLM neuron functionality.

27.05.2025 16:06 👍 2 🔁 0 💬 1 📌 0

These differences were previously noted (e.g.,
Durmus et al., see image), but had not been systematically analyzed.
We take an additional step by introducing two simple, efficient metrics to characterize features: the input score and the output score.

27.05.2025 16:06 👍 2 🔁 0 💬 1 📌 0

In this work we characterize two feature roles: Input features, which mainly capture patterns in the model's input, and output features, those with a human-understandable effect on the model's output.

Steering with each yields very different effects!

27.05.2025 16:06 👍 2 🔁 0 💬 1 📌 0

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

27.05.2025 16:06 👍 18 🔁 6 💬 2 📌 3

Logo for MIB: A Mechanistic Interpretability Benchmark

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

23.04.2025 18:15 👍 51 🔁 15 💬 1 📌 6

Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite mu...

🚨🚨 New preprint 🚨🚨

Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?

We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.

arxiv.org/abs/2502.14829

21.02.2025 12:42 👍 48 🔁 13 💬 2 📌 3

Dana Arad

Latest posts by Dana Arad @danaarad