What does a language model model? - Vaishnavh Nagarajan
TL;DR: Does the next-token logit track the conditional or the joint probability of the whole sequence?I had an invisi...
A recent paper (arxiv.org/abs/2602.18671) made me question something basic: do the logits of a language model model the next-token or the full sequence distribution? It really messed with my brain (in a fun way!). I wrote about the paper to clarify my thinking.
vaishnavh.github.io/blog/joint-o...
03.03.2026 00:29
๐ 13
๐ 2
๐ฌ 1
๐ 0
(the exact observation is even stronger than what I wrote here e.g., the low-rank structure "generalizes" across prompt-response pairs.)
19.02.2026 22:39
๐ 1
๐ 0
๐ฌ 1
๐ 0
but it turns out that if you arrange next-token logits from pairs of prompt x response sequences into a matrix (see pic for the exact object), you still get a *linear* *low-rank* structure. neither this linearity, nor the low-rankness follows by design. it somehow emerges from training.
19.02.2026 22:39
๐ 0
๐ 0
๐ฌ 1
๐ 0
here's my understanding: the low-rank observation is a non-trivial extension of a more straightforward & well-known observation called the softmax bottleneck. If you stack a bunch of next-token logits from various prompts, you'll get a low-rank matrix. this is by *design* (the last layer bottleneck)
19.02.2026 22:39
๐ 2
๐ 0
๐ฌ 1
๐ 0
If the low-rank logits really holds across settings, I expect it should have a lot of downstream corollaries & connections waiting to be discovered
19.02.2026 22:20
๐ 0
๐ 0
๐ฌ 1
๐ 0
Incredibly, you can select these datapoints through a straightforward method: see whether the given preference is aligned with a model prompted with the target behavior. (i'd have expected that you'd need an exponential search over all possible data subsets to accomplish this)
19.02.2026 22:20
๐ 0
๐ 0
๐ฌ 1
๐ 0
This paper discovers another spooky generalization effect: to trigger any target behavior in an LLM, you can carefully subselect from a *completely unrelated* preference dataset such that preference finetuning on that subselected dataset produces that behavior.
19.02.2026 22:20
๐ 1
๐ 0
๐ฌ 1
๐ 0
The visual world is composed of objects, and those objects are composed of features. But do VLMs exploit this compositional structure when processing multi-object scenes? In our ๐๐ #ICLR2026 paper, we find they do โ via emergent symbolic mechanisms for visual binding. ๐งต๐
05.02.2026 20:54
๐ 83
๐ 25
๐ฌ 1
๐ 3
He also contrasts the personalities of Hardy and Einstein:
13.01.2026 20:50
๐ 2
๐ 0
๐ฌ 0
๐ 0
Currently reading "a mathematician's apology" by GH Hardy. This is excerpt the foreword by CP Snow describing Hardy's personality and his work:
13.01.2026 20:49
๐ 13
๐ 1
๐ฌ 1
๐ 0
in associative memory, the latent space doesn't really encode any interesting distance.
imagine you're trying to store which countries share borders. you could simply write down a list of adjacent countries OR you could visualize the world map in your head. this is "associative" vs "geometric".
08.01.2026 22:47
๐ 1
๐ 1
๐ฌ 1
๐ 0
fascinating!
12.01.2026 19:08
๐ 1
๐ 0
๐ฌ 0
๐ 0
Would love pointers to related lit! Will DM you about the other question. Thank you for your kind words!
12.01.2026 19:04
๐ 0
๐ 0
๐ฌ 0
๐ 0
Rare to see such long term efforts these days ๐ซก
09.01.2026 22:52
๐ 14
๐ 1
๐ฌ 0
๐ 0
We introduce epiplexity, a new measure of information that provides a foundation for how to select, generate, or transform data for learning systems. We have been working on this for almost 2 years, and I cannot contain my excitement! arxiv.org/abs/2601.03220 1/7
07.01.2026 17:27
๐ 143
๐ 34
๐ฌ 9
๐ 9
Please welcome Google's Open Source efforts to Blue Sky at @opensource.google!
07.01.2026 21:12
๐ 248
๐ 38
๐ฌ 7
๐ 4
for deeper models, they initialize the network in a way that the decomposition of each layer aligns with the previous layer. if you didn't assume this, there'll be "interference" across components which I *suspect* would contribute to associative memorization.
08.01.2026 22:52
๐ 4
๐ 1
๐ฌ 0
๐ 0
now if I ask you "how many countries away is Mongolia from India?", in the lookup table approach, you've to sit and piece together the connections by iterating over a frustratingly long list. in the map approach, you can "see" the answer quickly.
08.01.2026 22:47
๐ 1
๐ 0
๐ฌ 1
๐ 0
in associative memory, the latent space doesn't really encode any interesting distance.
imagine you're trying to store which countries share borders. you could simply write down a list of adjacent countries OR you could visualize the world map in your head. this is "associative" vs "geometric".
08.01.2026 22:47
๐ 1
๐ 1
๐ฌ 1
๐ 0
Thanks for engaging with the work! Could you elaborate? I'm not an expert on graph theory but I'd be interested in any ideas to better understand this.
08.01.2026 22:41
๐ 0
๐ 0
๐ฌ 0
๐ 0
18/ We hope this inspires revisiting analyses of Transformer knowledge/storage capacity/unlearning. Graph setups may also help cleanly understand the emergence of โworld modelsโ.
08.01.2026 20:31
๐ 7
๐ 0
๐ฌ 1
๐ 0
17/ Our findings suggest there's "magic" in integrating knowledge into model weights rather than stuffing it into context. It also shows a vivid contrast between traditional retrieval with two-tower models vs modern generative retrieval models.
08.01.2026 20:31
๐ 8
๐ 0
๐ฌ 1
๐ 0
16/ And practically: how do we make Transformer memory more geometric (if you want hasty reasoning/creativity) or more associative (if you want accurate retrieval, no hallucination)?
Understanding & manipulating this competition is a fundamental open question.
08.01.2026 20:31
๐ 11
๐ 0
๐ฌ 1
๐ 1
15/ Indeed in hindsight, the deeper Transformer model produces less elegant geometries than node2vec.
In more general (nastier) graphs, we suspect that memory may be a mix of associative/geometric. So, what notion of graph "complexity" & training dictates this bias?
08.01.2026 20:31
๐ 6
๐ 0
๐ฌ 1
๐ 0
14/ The more advanced open question is to study the dynamics in non-shallow models, where associative memory becomes a real competitor!
Here you can NOT merrily disentangle the dynamics of each spectral component to analyze them (not w/o weird assumptions about initialization)
08.01.2026 20:31
๐ 9
๐ 0
๐ฌ 2
๐ 0
13/ But strangely, prior node/word2vec analyses assume pressures (bottleneck, early-stopping etc.,) to explain the low-rank bias.
Our analysis intuits how *the CE loss* by nature nicely induces a low-rank spectral bias. We leave a formal proof as an open theoretical question.
08.01.2026 20:31
๐ 9
๐ 0
๐ฌ 1
๐ 0