Valentin Hofmann (@valentinhofmann)

📢 Life update 📢

After a wonderful time at @ai2.bsky.social, I've joined @cislmu.bsky.social at @lmu.de as a tenure-track assistant professor in NLP. Thrilled to be back in Europe and to start a lab in Munich's flourishing AI ecosystem! 🎉

03.03.2026 14:58 👍 19 🔁 0 💬 1 📌 0

Demographic cues (eg, names, dialect) are widely used to study how LLM behavior may change depending on user demographics. Such cues are often assumed interchangeable.

🚨 We show they are not: different cues yield different model behavior for the same group and different conclusions on LLM bias. 🧵👇

27.01.2026 13:07 👍 18 🔁 10 💬 1 📌 0

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

15.12.2025 17:19 👍 75 🔁 15 💬 1 📌 4

Excited to see our #COLM2025 paper on fluid benchmarking highlighted by @eval-eval.bsky.social! They are worth a follow if you are into LLM eval research. 🔬

31.10.2025 17:25 👍 2 🔁 0 💬 0 📌 0

There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters.

IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it!

New results 🧵

29.10.2025 16:11 👍 32 🔁 11 💬 1 📌 0

Check out this #EMNLP2025 paper led by @minhducbui.bsky.social and @carolin-holtermann.bsky.social showing dialect prejudice remains a major issue in current LLMs.

Example: GPT-5 associates German dialect speakers with being uneducated and steers them toward stereotyped jobs (e.g., farmworkers).

👇

14.10.2025 16:01 👍 7 🔁 0 💬 0 📌 0

Thanks, Jordan! Your ACL 2021 paper was a huge source of inspiration for us!

19.09.2025 19:04 👍 1 🔁 0 💬 0 📌 0

We did not specifically analyze novel models as your paper did. While I am optimistic that Fluid Benchmarking improves over static IRT-based methods in this regime as well, there are definitely limitations, which we discuss in the paragraph below.

Would be exciting to run more experiments on this!

19.09.2025 18:52 👍 0 🔁 0 💬 0 📌 0

In our experiments, we find that this dynamic approach consistently outperforms static IRT-based methods. The improvements are especially pronounced in terms of variance, which poses a major challenge for static IRT-based methods. We discuss this in more detail in the paragraph below.

19.09.2025 18:52 👍 0 🔁 0 💬 1 📌 0

Great question! The key difference is that we use IRT to dynamically adapt the subset of items to a model's capability, rather than to determine a static, "globally optimal" subset of items as in prior work. With Fluid Benchmarking, each model is evaluated on a different subset of items.

19.09.2025 18:52 👍 0 🔁 0 💬 1 📌 0

LM benchmark design requires 3 decisions, how to:
🐟 select test cases
🐠 score LM on each test
🦈 aggregate scores to estimate perf

fluid benchmarking is simple:
🍣 find max informative test cases
🍥 estimate 'ability', not simple avg perf

why care? turn ur grey noisy benchmarks to red ones!

17.09.2025 18:17 👍 5 🔁 2 💬 0 📌 0

Last but not least, a huge shoutout to my incredible coauthors @davidheineman.com, @ianmagnusson.bsky.social, @kylelo.bsky.social, @jessedodge.bsky.social, @maartensap.bsky.social, Pang Wei Koh, Chun Wang, @hanna-nlp.bsky.social, and @nlpnoah.bsky.social! 🤗

16.09.2025 17:16 👍 3 🔁 0 💬 0 📌 0

For details, check out our paper, blog, code, and data:

📄 arxiv.org/abs/2509.11106
✍️ allenai.org/blog/fluid-b...
💻 github.com/allenai/flui...
📊 huggingface.co/datasets/all...

Looking forward to chatting more at #COLM2025! 👋

16.09.2025 17:16 👍 2 🔁 0 💬 1 📌 0

Overall, our work shows that LLM evaluations can be substantially improved by moving beyond the until-now universal practice of static benchmarking, which assumes a globally optimal set of evaluation questions for all models.

16.09.2025 17:16 👍 2 🔁 0 💬 1 📌 0

These (and more) advantages are achieved while at the same time reducing evaluation cost.

Example: on MMLU, Fluid Benchmarking results in lower step-to-step variance and higher validity than standard methods while using 50 times fewer questions. ⚡

16.09.2025 17:16 👍 1 🔁 0 💬 1 📌 0

Fluid Benchmarking substantially reduces step-to-step variance during pretraining.

It also increases validity: results generalize better to other benchmarks targeting the same capability. One reason: it automatically avoids mislabeled questions, cutting label errors by 99%! 🤯

16.09.2025 17:16 👍 1 🔁 0 💬 1 📌 0

In our experiments, we apply Fluid Benchmarking to evaluation during pretraining, a setting where capabilities evolve rapidly.

We find that Fluid Benchmarking dynamically adapts to these changes, administering easier questions early in training and more difficult ones later.

16.09.2025 17:16 👍 1 🔁 0 💬 1 📌 0

Fluid Benchmarking repeats this loop until the number of administered questions reaches the allotted budget.

Adaptive question selection means that LLMs face different sets of questions, but ability estimation aligns results in a common space.

16.09.2025 17:16 👍 1 🔁 0 💬 1 📌 0

In Fluid Benchmarking, we start with an initial ability estimate from one question.

To select the next question, we use Fisher information. Essentially: a question close in difficulty (b) to the ability estimate (θ) and with high discrimination (a).

Then we update the estimate.

16.09.2025 17:16 👍 1 🔁 0 💬 1 📌 0

In addition, IRT models each LLM's ability, which can be estimated from its responses to questions with known difficulty and discrimination.

The IRT ability estimate can be used to summarize performance like accuracy, and it accounts for question characteristics.

16.09.2025 17:16 👍 0 🔁 0 💬 1 📌 0

To get a question's difficulty, we use item response theory (IRT): we analyze responses of hundreds of LLMs to see how often a question is answered correctly.

IRT also measures the discrimination of a question, meaning how reliably it separates stronger from weaker LLMs.

16.09.2025 17:16 👍 1 🔁 0 💬 1 📌 0

Test theory says: questions are most informative when matched to a test taker's ability.

For LLMs, that means evaluating weaker models on easier questions and stronger models on harder ones.

But how do we know a question's difficulty, or an LLM's ability, before evaluation? 🤔

16.09.2025 17:16 👍 1 🔁 0 💬 1 📌 0

📢 New #COLM2025 paper 📢

Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴

Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.

🧵

16.09.2025 17:16 👍 41 🔁 10 💬 3 📌 1

I am delighted to share our new #PNAS paper, with @grvkamath.bsky.social @msonderegger.bsky.social and @sivareddyg.bsky.social, on whether age matters for the adoption of new meanings. That is, as words change meaning, does the rate of adoption vary across generations? www.pnas.org/doi/epdf/10....

29.07.2025 12:31 👍 49 🔁 13 💬 3 📌 1

Attending #ICML2025? Don't miss this TokShop panel, which will explore:

🔮 The Future of Tokenization 🔮

Featuring a stellar lineup of panelists - mark your calendar! ✨

16.07.2025 15:28 👍 4 🔁 0 💬 0 📌 0

LLMs can appear unbiased on the surface but still perpetuate racist views in subtle ways.

What causes this discrepancy? 🔍

In our upcoming #ACL2025 paper, we find a pattern akin to racial colorblindness: LLMs suppress race in ambiguous contexts, leading to biased outcomes.

10.06.2025 18:13 👍 6 🔁 0 💬 0 📌 0

📣 We extend the submission deadline by 24 hours to avoid conflict with ACL camera-ready deadline.

📅 New Submission Deadline: May 31, 2025 (23:59 AoE)

📩 OpenReview: openreview.net/group?id=ICM...

30.05.2025 21:52 👍 1 🔁 1 💬 0 📌 0

Huge congrats, Adam!!! 🎉

29.05.2025 16:15 👍 1 🔁 0 💬 0 📌 0

Got a good tokenization paper under review at COLM, but the scores were a letdown? 😬

Why bother with rebuttal when the perfect venue is right around the corner!

Submit your paper to the #ICML2025 Tokenization Workshop (TokShop) by May 30! 🚀

28.05.2025 08:24 👍 11 🔁 4 💬 0 📌 0

Beyond text: Modern AI tokenizes images too! Vision models split photos into patches, treating each 16x16 pixel square as a "token." 🖼️➡️🔤 #VisualTokenization

Interested in tokenization? Join our workshop tokenization-workshop.github.io
The submission deadline is already May 30!

26.05.2025 19:55 👍 4 🔁 2 💬 0 📌 0

Valentin Hofmann

Latest posts by Valentin Hofmann @valentinhofmann