I am recruiting a PhD student to work with me, Peter Cholak, Anand Pillay, and Andy Yang @pentagonalize.bsky.social on transformers and logic/model theory (or related topics). If you are interested, please email me with "FLaNN" in the subject line!
I am recruiting a PhD student to work with me, Peter Cholak, Anand Pillay, and Andy Yang @pentagonalize.bsky.social on transformers and logic/model theory (or related topics). If you are interested, please email me with "FLaNN" in the subject line!
Read the cookbook: arxiv.org/abs/2510.00368
Join us for weekly seminars on formal language theory, ML, NLP, and more: flannseminars.github.io
Thanks to all the chefs: @ccwatson.bsky.social, @antonxue.bsky.social, @satwik77.bsky.social, @ll4r3n4.bsky.social, @lambdaviking.bsky.social, Emile Dos Santos Ferreira, @anejsvete.bsky.social, @dchiang.bsky.social
There is no better way to understand what transformers can do than to get your hands dirty and construct them, weight-by-weight. The Transformer Cookbook provides a guide for anyone aiming to understand the expressive power of transformers on such a formal level.
We present The Transformer Cookbook: a collection of recipes for programming algorithms directly into transformers!
Hungry for an induction head? Craving a Dyck language recognizer? We show you step-by-step how to cook up transformers for these algorithms and many more!
Andy Yang, Christopher Watson, Anton Xue, Satwik Bhattamishra, Jose Llarena, William Merrill, Emile Dos Santos Ferreira, Anej Svete, David Chiang: The Transformer Cookbook https://arxiv.org/abs/2510.00368 https://arxiv.org/pdf/2510.00368 https://arxiv.org/html/2510.00368
Andy Yang @pentagonalize.bsky.social drove the conceptualization, theory, and experiments of this work. I was just the checker and editor!
Although there is a lot of wiggle room in defining rounding/precision, our theoretical predictions are confirmed by experiments surprisingly well!
The separating languages are very simple: L_k is the language of k blocks of one or more repetitions of a symbol, e.g., L_3 contains strings aba, aabbbbaaaaaa, etc. More blocks require more depth.
Further, we show that deeper programs/formulas in C-RASP are strictly more expressive than shallower programs/formulas. Together, these results imply that in the above-defined variant, deeper transformers are strictly more expressive than shallower transformers.
C-RASP is a programmer-friendly version of "temporal logic with future-masked counting." We show both are exactly equivalent to soft-attention transformers with fixed precision outside attention but no rounding inside attention (to avoid under/overflow summing over sequence).
New on arXiv: Knee-Deep in C-RASP, by @pentagonalize.bsky.social, @cadilhac.bsky.social, and me. The solid stepped line is our theoretical prediction based on what problems C-RASP can solve, and the numbers/colors are what transformers (no position embedding) can learn.
(Out of the papers that Aarohi @aarsri.bsky.social has published while at Notre Dame, 80% have received an award!)
In contrast, on text with variation involving new words or meanings (e.g., "lie" vs. "cap"), far more data is needed, but it leads to a massive breakthrough in performance.
On text with character-level variation (e.g., "strategy" vs. "strat"), out-of-the-box performance improves even with a few additional training examples -- but approaches a plateau, suggesting that more data is not the solution.
Congratulations to Aarohi Srivastava @aarsri.bsky.social on winning the Best Paper Award at W-NUT at NAACL 2025! This paper applies various interventions simulating noisy text or dialectal variation to discover how different interventions have different effects.
If you're submitting an abstract to @colmweb.org, might as well submit it to MSLD too! nlp.nd.edu/msld25/
Registration at Midwest Speech and Language Days is free, poster printing is free, and we will be able to provide free lodging to a limited number of students. nlp.nd.edu/msld25/
The abstract submission deadline for Midwest Speech and Language Days is in two days, on March 20! Please submit an abstract! MSLD is non-archival, and submissions of both work-in-progress and previously published work are encouraged. nlp.nd.edu/msld25/
The meeting will feature keynote addresses by
@mohitbansal.bsky.social, @davidrmortensen.bsky.social, Karen Livescu, and Heng Ji. Plus all of your great talks and posters! nlp.nd.edu/msld25
Midwest Speech and Language Days will be held Apr 15-16 at
@NotreDame! Abstract submissions are due Mar 20, and registration deadline is Mar 27. Financial assistance for students (lodging, poster printing) is available. nlp.nd.edu/msld25
Oops, and @sleyna.bsky.social
Oops, should be @pentagonalize.bsky.social
showing that arbitrary-precision average-hard attention transformers and poly(n)-precision softmax-attention transformers are in DLOGTIME-uniform TC0.
(3) "Transformers in TC0" arxiv.org/abs/2409.13629. Previous work by
@lambdaviking.bsky.social, Ashish Sabharwal, and @sleyna.bsky.social has shown that transformers with log(n) precision (where n is the input length) are in the circuit complexity class TC0. This paper improves these results,
We study how transformers express formal language *transductions*. For example, unique-hard attention transformers are equivalent to star-free languages. But star-free languages don't have a transduction analogue; they have at least three! Which one is it?
(2) With @sleyna.bsky.social, Dana Angluin, Jon Rawski, and Ashish Sabharwal, "Transformers as Transducers" arxiv.org/abs/2404.02040. To appear in TACL.
We also show how softmax-attention transformers can simulate many average-hard attention transformers (including Perez et al's well-known average-hard attention transformer simulating a Turing machine). But it's more difficult than often seems to be assumed!