Jackson Petty's Avatar

Jackson Petty

@jacksonpetty.org

the passionate shepherd, to his love • ἀρετῇ • מנא הני מילי

209
Followers
252
Following
313
Posts
12.04.2023
Joined
Posts Following

Latest posts by Jackson Petty @jacksonpetty.org

Zero as in “001” (the leading digits are silent)

13.12.2025 17:09 👍 2 🔁 0 💬 0 📌 0

"there's a new serif in town"

10.12.2025 01:06 👍 14787 🔁 2508 💬 1017 📌 256
Post image

📢 Postdoc position 📢

I’m recruiting a postdoc for my lab at NYU! Topics include LM reasoning, creativity, limitations of scaling, AI for science, & more! Apply by Feb 1.

(Different from NYU Faculty Fellows, which are also great but less connected to my lab.)

Link in 🧵

02.12.2025 16:04 👍 21 🔁 12 💬 2 📌 1

Surely a third account would help to clarify the matter

29.10.2025 12:38 👍 7 🔁 0 💬 2 📌 0

lmao were you also on the 12:17 to GC?

05.10.2025 18:29 👍 0 🔁 0 💬 1 📌 0
Post image

Ad infinitum

05.10.2025 17:28 👍 29 🔁 7 💬 0 📌 0
Preview
LLMs Switch to Guesswork Once Instructions Get Long LLMs abandon reasoning for guesswork when instructions get long, new work from Linguistics PhD student Jackson Petty & CDS shows.

Linguistics PhD student @jacksonpetty.org finds LLMs "quiet-quit" when instructions get long, switching from reasoning to guesswork.

With CDS' @tallinzen.bsky.social, @shauli.bsky.social, @lambdaviking.bsky.social, @michahu.bsky.social, and Wentao Wang.

nyudatascience.medium.com/llms-switch-...

10.09.2025 15:26 👍 7 🔁 2 💬 0 📌 0

you’re telling me a star spangled this banner??

04.07.2025 11:49 👍 1 🔁 0 💬 0 📌 0
Preview
NYU LLM + cognitive science post-doc interest form Tal Linzen's group at NYU is hiring a post-doc! We're interested in creating language models that process language more like humans than mainstream LLMs do, through architectural modifications and int...

I'm hiring at least one post-doc! We're interested in creating language models that process language more like humans than mainstream LLMs do, through architectural modifications and interpretability-style steering. Express interest here: docs.google.com/forms/d/e/1F...

21.06.2025 15:13 👍 42 🔁 21 💬 2 📌 1

Kauaʻi is amazing

22.06.2025 12:15 👍 2 🔁 0 💬 0 📌 0

Thanks to my wonderful co-authors: @michahu.bsky.social, Wentao Wang, @shauli.bsky.social, @lambdaviking.bsky.social, and @tallinzen.bsky.social! Paper, dataset, and code at jacksonpetty.org/relic/

09.06.2025 18:02 👍 3 🔁 0 💬 0 📌 0

Eg: general direction following, or translation of *natural* languages based only on non-formal reference grammars. Our results here show that there is no a priori roadblock to success, but that there are overhangs between what models can do and what they actually do.

09.06.2025 18:02 👍 2 🔁 0 💬 1 📌 0

2. It’s natural to ask “well, why not just break out to tool use? Parsers can solve this task trivially.” That’s true! But I think it’s valuable to understand how formally-verifiable tasks can shed light on model behavior on tasks for which aren’t formally verifiable.

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0

This is contrary to the view that failure means “LLMs can’t reason”—failure here is likely correctable, and hopefully will make models more robust!

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0

Why is this important? Well, two main reasons:
1. The overhang between models’ knowledge of *how* to solve the task and their ability to follow through gives me hope that we produce models which are better at following complex instructions in-context.

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0
Post image

So, what did we learn?
1. LLMs *do* know how to follow instructions, but they often don’t
2. The complexity of instructions and examples reliably predicts whether (current) models can solve the task
3. On hard tasks, models (and people, tbh) like to fall back to heuristics

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0
Post image

But often models get distracted by irrelevant info, or “get lazy” and choose to rely on heuristics rather than actually verifying the instructions; we use o4-mini as an LLM judge to classify model strategies: as examples get more complex, models shift to relying on heuristics rather than rules:

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0
Post image

So, how can LLMs succeed at this task, and why do they fail when grammars and examples get complex? Well, models in general do understand the general solution: even small models recognize they can build a CYK table or do an exhaustive top-down search of the derivation tree:

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0
Post image

In general, we find that models tend to agree with one another on which grammars (left) and which examples (right) are hard, though again 4.1-nano and 4.1-mini pattern with each other against others. These correlations increase with complexity!

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0
Post image

Interestingly, models’ accuracies are reflective of divergent class biases: 4.1-nano and 4.1-mini love to predict strings as being positive, while all other models have the opposite bias; these biases also change with example complexity!

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0
Post image

What do we find? All models struggle on complex instruction sets (grammars) and tasks (strings); the best reasoning models are better than the rest, but still approach ~chance accuracy when grammars (top) have ~500 rules, or when strings (bottom) have >25 symbols.

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0

We release the static dataset used in our evals as RELIC-500, where the grammar complexity is capped at 500 rules.

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0

We introduce RELIC as an LLM evaluation: 1. generate a CFG of a given complexity; 2. sample positive (parses) and negative (doesn’t parse) strings from the grammar’s terminal symbols; 3. prompt the LLM with a (grammar, sample) pair and ask it to classify if the grammar generates the given string

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0

As an analogue for instruction sets and tasks, formal grammars have some really nice properties: they can be made arbitrarily complex, we can sample new ones easily (avoiding problems with dataset contamination), and we can verify a model’s accuracy using formal tools (parsers).

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0
Post image

LLMs are increasingly used to solve tasks “zero-shot,” with only a specification of the task given in a prompt. To evaluate LLMs on increasingly complex instructions, we turn to a classic problem in computer science and linguistics: recognizing if a formal grammar generates a given string.

09.06.2025 18:02 👍 1 🔁 0 💬 1 📌 0

Code, dataset, and paper at jacksonpetty.org/relic/

09.06.2025 18:02 👍 0 🔁 0 💬 1 📌 0
Post image

How well can LLMs understand tasks with complex sets of instructions? We investigate through the lens of RELIC: REcognizing (formal) Languages In-Context, finding a significant overhang between what LLMs are able to do theoretically and how well they put this into practice.

09.06.2025 18:02 👍 5 🔁 2 💬 1 📌 0

Such a shame that Apple doesn’t have much cash on hand for such expenditures

23.05.2025 22:33 👍 1 🔁 0 💬 0 📌 0

(not that this would _replace_ the scraped data in the near or medium term, but it probably would curry favor with public sentiment)

11.05.2025 10:43 👍 4 🔁 0 💬 0 📌 0

I’m surprised that some AI lab hasn’t tried to get some good PR by throwing money at artists / writers / etc to create private training distributions. Apple isn’t really leading in AI but this is the kind of thing I would have expected them to do.

11.05.2025 10:39 👍 2 🔁 0 💬 1 📌 0