Vaishnavh Nagarajan's Avatar

Vaishnavh Nagarajan

@vaishnavh

Foundations of AI. I like simple and minimal examples and creative ideas. I also like thinking about the next token ๐Ÿงฎ๐Ÿงธ Google | PhD, CMU | https://arxiv.org/abs/2504.15266 | https://arxiv.org/abs/2403.06963 vaishnavh.github.io

3,304
Followers
387
Following
211
Posts
13.11.2024
Joined
Posts Following

Latest posts by Vaishnavh Nagarajan @vaishnavh

What does a language model model? - Vaishnavh Nagarajan TL;DR: Does the next-token logit track the conditional or the joint probability of the whole sequence?I had an invisi...

A recent paper (arxiv.org/abs/2602.18671) made me question something basic: do the logits of a language model model the next-token or the full sequence distribution? It really messed with my brain (in a fun way!). I wrote about the paper to clarify my thinking.

vaishnavh.github.io/blog/joint-o...

03.03.2026 00:29 ๐Ÿ‘ 13 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(the exact observation is even stronger than what I wrote here e.g., the low-rank structure "generalizes" across prompt-response pairs.)

19.02.2026 22:39 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

but it turns out that if you arrange next-token logits from pairs of prompt x response sequences into a matrix (see pic for the exact object), you still get a *linear* *low-rank* structure. neither this linearity, nor the low-rankness follows by design. it somehow emerges from training.

19.02.2026 22:39 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

here's my understanding: the low-rank observation is a non-trivial extension of a more straightforward & well-known observation called the softmax bottleneck. If you stack a bunch of next-token logits from various prompts, you'll get a low-rank matrix. this is by *design* (the last layer bottleneck)

19.02.2026 22:39 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

If the low-rank logits really holds across settings, I expect it should have a lot of downstream corollaries & connections waiting to be discovered

19.02.2026 22:20 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Sequences of Logits Reveal the Low Rank Structure of Language Models A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a...

I also like the low-rank logits finding (arxiv.org/abs/2510.24966) because it provides a novel, simple and surprising abstraction to think about what function a trained LLM implements. It took me a *lot* of time to understand, appreciate and buy the exact result here...

19.02.2026 22:20 ๐Ÿ‘ 2 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Incredibly, you can select these datapoints through a straightforward method: see whether the given preference is aligned with a model prompted with the target behavior. (i'd have expected that you'd need an exponential search over all possible data subsets to accomplish this)

19.02.2026 22:20 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

This paper discovers another spooky generalization effect: to trigger any target behavior in an LLM, you can carefully subselect from a *completely unrelated* preference dataset such that preference finetuning on that subselected dataset produces that behavior.

19.02.2026 22:20 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understa...

Really liked this paper which ties up two observations that are equally mindboggling (low-rank logits & subliminal/weird generalization effects) and presents one other such observation

arxiv.org/abs/2602.04863

19.02.2026 22:20 ๐Ÿ‘ 20 ๐Ÿ” 5 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0
Post image

The visual world is composed of objects, and those objects are composed of features. But do VLMs exploit this compositional structure when processing multi-object scenes? In our ๐Ÿ†’๐Ÿ†• #ICLR2026 paper, we find they do โ€“ via emergent symbolic mechanisms for visual binding. ๐Ÿงต๐Ÿ‘‡

05.02.2026 20:54 ๐Ÿ‘ 83 ๐Ÿ” 25 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 3
Post image

He also contrasts the personalities of Hardy and Einstein:

13.01.2026 20:50 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Currently reading "a mathematician's apology" by GH Hardy. This is excerpt the foreword by CP Snow describing Hardy's personality and his work:

13.01.2026 20:49 ๐Ÿ‘ 13 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

in associative memory, the latent space doesn't really encode any interesting distance.

imagine you're trying to store which countries share borders. you could simply write down a list of adjacent countries OR you could visualize the world map in your head. this is "associative" vs "geometric".

08.01.2026 22:47 ๐Ÿ‘ 1 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

fascinating!

12.01.2026 19:08 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Would love pointers to related lit! Will DM you about the other question. Thank you for your kind words!

12.01.2026 19:04 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Rare to see such long term efforts these days ๐Ÿซก

09.01.2026 22:52 ๐Ÿ‘ 14 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

We introduce epiplexity, a new measure of information that provides a foundation for how to select, generate, or transform data for learning systems. We have been working on this for almost 2 years, and I cannot contain my excitement! arxiv.org/abs/2601.03220 1/7

07.01.2026 17:27 ๐Ÿ‘ 143 ๐Ÿ” 34 ๐Ÿ’ฌ 9 ๐Ÿ“Œ 9

Please welcome Google's Open Source efforts to Blue Sky at @opensource.google!

07.01.2026 21:12 ๐Ÿ‘ 248 ๐Ÿ” 38 ๐Ÿ’ฌ 7 ๐Ÿ“Œ 4
Post image

for deeper models, they initialize the network in a way that the decomposition of each layer aligns with the previous layer. if you didn't assume this, there'll be "interference" across components which I *suspect* would contribute to associative memorization.

08.01.2026 22:52 ๐Ÿ‘ 4 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap ...

thanks for being curious about it :-) I'm basing this off of the assumptions made in this seminal paper arxiv.org/abs/1312.6120

they begin with an analysis of 2-layer (weight-untied) models where the dynamics neatly evolve along each spectral component.

08.01.2026 22:52 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

now if I ask you "how many countries away is Mongolia from India?", in the lookup table approach, you've to sit and piece together the connections by iterating over a frustratingly long list. in the map approach, you can "see" the answer quickly.

08.01.2026 22:47 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

in associative memory, the latent space doesn't really encode any interesting distance.

imagine you're trying to store which countries share borders. you could simply write down a list of adjacent countries OR you could visualize the world map in your head. this is "associative" vs "geometric".

08.01.2026 22:47 ๐Ÿ‘ 1 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Thanks for engaging with the work! Could you elaborate? I'm not an expert on graph theory but I'd be interested in any ideas to better understand this.

08.01.2026 22:41 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
Deep sequence models tend to memorize geometrically; it is unclear why Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage...

19/ These findings build on many nascent, fragmented observations in literature not credited here due to low space. There are also caveats in extending all this to natural language (each caveat, an open question ;) ). Please see the full story here:

arxiv.org/abs/2510.26745

08.01.2026 20:31 ๐Ÿ‘ 15 ๐Ÿ” 4 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

18/ We hope this inspires revisiting analyses of Transformer knowledge/storage capacity/unlearning. Graph setups may also help cleanly understand the emergence of โ€œworld modelsโ€.

08.01.2026 20:31 ๐Ÿ‘ 7 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

17/ Our findings suggest there's "magic" in integrating knowledge into model weights rather than stuffing it into context. It also shows a vivid contrast between traditional retrieval with two-tower models vs modern generative retrieval models.

08.01.2026 20:31 ๐Ÿ‘ 8 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

16/ And practically: how do we make Transformer memory more geometric (if you want hasty reasoning/creativity) or more associative (if you want accurate retrieval, no hallucination)?
Understanding & manipulating this competition is a fundamental open question.

08.01.2026 20:31 ๐Ÿ‘ 11 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Post image

15/ Indeed in hindsight, the deeper Transformer model produces less elegant geometries than node2vec.

In more general (nastier) graphs, we suspect that memory may be a mix of associative/geometric. So, what notion of graph "complexity" & training dictates this bias?

08.01.2026 20:31 ๐Ÿ‘ 6 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

14/ The more advanced open question is to study the dynamics in non-shallow models, where associative memory becomes a real competitor!

Here you can NOT merrily disentangle the dynamics of each spectral component to analyze them (not w/o weird assumptions about initialization)

08.01.2026 20:31 ๐Ÿ‘ 9 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0
Post image

13/ But strangely, prior node/word2vec analyses assume pressures (bottleneck, early-stopping etc.,) to explain the low-rank bias.

Our analysis intuits how *the CE loss* by nature nicely induces a low-rank spectral bias. We leave a formal proof as an open theoretical question.

08.01.2026 20:31 ๐Ÿ‘ 9 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0