Shauli Ravfogel's Avatar

Shauli Ravfogel

@shauli

Faculty fellow at NYU CDS. Previously: PhD @ BIU NLP.

764
Followers
351
Following
44
Posts
07.10.2023
Joined
Posts Following

Latest posts by Shauli Ravfogel @shauli

I’ll be at NeurIPS in San Diego presenting this paper during the Wednesday, Dec 3 poster session (11 am – 2 pm PST) & at the mechanistic interpretability workshop on Sunday (spotlight). Come say hi, and feel free to DM if you’d like to talk research or just catch up!

01.12.2025 15:10 👍 8 🔁 1 💬 0 📌 0

Are there existing theories that explain the subjective aspect of consciousness (“qualia”)? To my understanding most “theories of consciousness” don’t even make an attempt to do so (either aiming to explain other aspects of consciousness, or claiming that qualia doesn’t “really” exist, etc)

02.11.2025 02:15 👍 1 🔁 0 💬 1 📌 0

These findings show, at best, that training data can make an LM act as if it has conscious experience (unsurprising). They don’t support the claim that the model actually has subjective experience, because no mechanistic intervention can establish that (in think you’re already hinting toward this).

01.11.2025 15:26 👍 3 🔁 0 💬 1 📌 0

Beyond truth-encoding, toy models can be adopted to study other fascinating phenomena, like representation of other concepts in transformers, and the way models utilize the structure of their representations to develop nontrivial capabilities. Arxiv: arxiv.org/pdf/2510.15804

24.10.2025 15:19 👍 3 🔁 0 💬 0 📌 0
Post image

We show that a similar phenomenon is reproduced when using natural language data, instead of our synthetic data, although in “real”, pretrained LMs, the layer normalization doesn’t seem to play such an important role in inducing linear separability.

24.10.2025 15:19 👍 0 🔁 0 💬 1 📌 0
Post image

*Assuming* this structure, we prove that (1) the model represents truth linearly; (2) confidence on the false sequence is decreased. This mechanism relies on a difference in norm between true and false sequence, that the layer normalization translates to linear separability.

24.10.2025 15:19 👍 0 🔁 0 💬 1 📌 0
Post image

An analysis of the first gradient steps also predicts the gradual emergence of this structure, with the block corresponding to memorization appearing first, followed by the block that gives rise to linear separability and confidence modulation.

24.10.2025 15:19 👍 0 🔁 0 💬 1 📌 0
Post image

To understand the emergence, we study the structure of the attention matrix. It turns out that it is highly structured, with blocks corresponding to mapping subjects to attributes (lower left), attributes to subjects (upper middle), and additional ones.

24.10.2025 15:19 👍 0 🔁 0 💬 1 📌 0
Post image

False sequences contain a uniform attribute (instead of the memorized one), so the ideal behavior on them is to output a uniform guess. We see that the probability the model allocates to the memorized attribute starts dropping *exactly* when the linear truth signal emerges.

24.10.2025 15:19 👍 0 🔁 0 💬 1 📌 0
Post image

We train a simplified transformer model on this task, with a single layer and a single attention head. We see an abrupt emergence of linear separability of the hidden representations by factuality.

24.10.2025 15:19 👍 0 🔁 0 💬 1 📌 0
Post image

We start with a “truth co-occurrence" hypothesis: false assertions tend to co-occur. Thus, LMs are incentivized to infer the truth latent variable to reduce future loss. We define a factual recall task, where each instance contains two sequences, whose truthfulness correlates.

24.10.2025 15:19 👍 0 🔁 0 💬 1 📌 0
Post image

New NeurIPS paper! Why do LMs represent concepts linearly? We focus on LMs's tendency to linearly separate true and false assertions, and provide an analysis of the truth circuit in a toy model. A joint work with Gilad Yehudai, @tallinzen.bsky.social, Joan Bruna and @albertobietti.bsky.social.

24.10.2025 15:19 👍 25 🔁 5 💬 1 📌 1

8/8 We advocate moving from benchmark- to skill-centric evaluation. Modeling latent structure clarifies strengths and gaps in evaluation, guiding dataset design and LLM development.

Shout-out to the first author Aviya Maimon for her principled, and at times painstaking, work!

31.07.2025 12:37 👍 1 🔁 0 💬 0 📌 0
Post image

7/8 This factor space lets us (i) flag benchmarks that add little new information, (ii) predict a new model’s full profile from a small task subset, and (iii) choose the best model for an unseen task with minimal trials.

31.07.2025 12:37 👍 0 🔁 0 💬 1 📌 0
Post image

6/8 From this analysis, eight main latent skills emerge—e.g., General NLU, Long-document comprehension, Precision-sensitive answers—allowing each model to receive a more fine-grained skill profile instead of one aggregated number.

31.07.2025 12:37 👍 0 🔁 0 💬 1 📌 0
Post image

5/8 Using Principal-Axis Factoring, we separate shared variance (true shared skills) from task-specific noise, yielding an interpretable low-dimensional latent space.

31.07.2025 12:37 👍 0 🔁 0 💬 1 📌 0

4/8 We built a 60-model × 44-task performance matrix, harmonizing diverse metrics onto a 0–10 scale. This sets our empirical foundation for the analysis. Our goal is to identify a small set of underlying latent model capabilities that explain these scores in a data-drive way.

31.07.2025 12:37 👍 0 🔁 0 💬 1 📌 0

3/8 Drawing on psychometrics, we treat every benchmark item as a test question and apply Exploratory Factor Analysis to uncover the latent abilities those tasks only approximate.

31.07.2025 12:37 👍 0 🔁 0 💬 1 📌 0

2/8 Today’s practice relies on partially redundant benchmarks that aim to measure latent, underlying capabilities. Evaluation usually collapses them into a single average, obscuring a model’s real strengths and weaknesses.

31.07.2025 12:37 👍 0 🔁 0 💬 1 📌 0
Post image

1/8 Happy to share our new paper—“IQ Test for LLMs”—co-authored with Aviya Maimon, Amir DN Cohen, @neurogal.bsky.social and Reut Tsarfaty. We propose to rethink how language models are evaluated by focusing on the latent capabilities that explain benchmark results.
Arxiv: arxiv.org/pdf/2507.20208

31.07.2025 12:37 👍 4 🔁 0 💬 1 📌 1

I’ll be at #ACL2025! If you’re around and want to catch up or chat, please ping me!

26.07.2025 18:05 👍 7 🔁 0 💬 0 📌 0
Post image

How well can LLMs understand tasks with complex sets of instructions? We investigate through the lens of RELIC: REcognizing (formal) Languages In-Context, finding a significant overhang between what LLMs are able to do theoretically and how well they put this into practice.

09.06.2025 18:02 👍 5 🔁 2 💬 1 📌 0
Post image

Introduction \ updates post (better late than never):
- I recently graduated from ELSC @hebrewuniversity.bsky.social 🎉
- Moved to NYC 🗽 (view from our balcony below 👇)
- And started a postdoc in @columbiauniversity.bsky.social

Pls PM if you are in the NYC area and want to talk (or have a beer 🍻)

28.03.2025 23:56 👍 7 🔁 2 💬 3 📌 0

Congrats!!

27.03.2025 02:36 👍 1 🔁 0 💬 0 📌 0

Yes (:

10.03.2025 16:28 👍 2 🔁 0 💬 0 📌 0

There's still more to explore—especially in making the reconstruction process more faithful and generalizable across data distributions—but we see this as a promising step toward improving the interpretability of interventions in natural language.

Arxiv: arxiv.org/pdf/2402.11355 (6/6)

12.02.2025 15:19 👍 4 🔁 0 💬 0 📌 1
Post image

We also demonstrate that the generated counterfactuals can be used for data augmentation, helping the model become more invariant to sensitive features. (5/6)

12.02.2025 15:19 👍 1 🔁 0 💬 1 📌 0
Post image

Focusing on intervening in the gender concept, we find that the intervention causes changes beyond explicit markers such as pronouns, demonstrating *which* biases are encoded in the representation, and explicitly presenting them in natural language. (4/6)

12.02.2025 15:19 👍 1 🔁 0 💬 1 📌 0

The approach is straightforward: we build on the representation inversion technique proposed by Morris et al., which maps hidden representations back to text. By applying this method after an intervention, we can attribute changes in the reconstructed text to the effects of that intervention. (3/6)

12.02.2025 15:19 👍 2 🔁 0 💬 1 📌 0
Post image

Many techniques have been proposed to intervene in the high-dimensional representation space of language models. However, it's difficult to understand which features are being modified, and how. Our goal is to translate those intervened representations back into natural language. (2/6)

12.02.2025 15:19 👍 1 🔁 0 💬 1 📌 0