It was a great fun working on this with Caiqi @dirkhovy.bsky.social Nigel @cambridgeltl.bsky.social @milanlp.bsky.social
It was a great fun working on this with Caiqi @dirkhovy.bsky.social Nigel @cambridgeltl.bsky.social @milanlp.bsky.social
7/7 As models become ever so widely deployed in the real world, we need to build models that are not just capable, but genuinely reliable.
Let us know what you think! What LLM failure do you think is most underappreciated as a calibration problem?
Paper: www.techrxiv.org/doi/full/10....
Deploy โ design interfaces that let uncertainty reach the humans acting on model outputs.
6/7 Our call to action:
Measure โ make calibration metrics standard in benchmark reporting. Leaderboards track accuracy; almost none track calibration.
Train โ base models start well-calibrated. We need alignment methods that don't destroy that.
5/7 The field treats each failure with a bespoke patch: RAG for hallucination, temperature for diversity, safety classifiers for over-refusal.
But these work around the model's broken uncertainty rather than fixing it. That's why the same failures keep resurfacing in new forms.
4/7 We argue these aren't separate bugs. They're four facets of the same problem:
๐ด Probabilistic โ can't match requested distributions
๐ Semantic โ confidence โ correctness
๐ต Distributional โ output diversity collapse
๐ข Metacognitive โ can't assess its own competence
3/7 It goes deeper than randomness.
Ask it "what book should I read?" and it defaults to the same WEIRD-centric bestsellers. Ask a nuanced political question and it responds with near-zero variation, hallucinating consensus where none exists.
It can spend 1,000 tokens "reasoning" about 2+3.
2/7 Try this: ask any LLM for a random number between 1 and 10.
Models prefer "7" much more often than anything else. It can describe probability correctly when you ask. It just can't do it.
1/7 ๐งต The GPT-4 technical report featured detailed calibration curves.
Since then, not a single major model release has reported calibration. The field quietly stopped measuring whether models know what they don't know.
Our new position paper argues this is a mistake. Here's why.
A privilege to represent @cambridgeltl.bsky.social @camlangsci.bsky.social @gatescambridge.bsky.social
Huge thanks to the entire team, the Secretariat, the Expert Advisory Panel and all reviewers.
A crucial challenge is evaluation: existing evaluation methods do not reliably reflect how systems perform in real-world settings. Nonetheless, with companies investing hundreds of billions to scale up, we expect model capabilities to continue growing.
Yet, the frontier remains "jagged": models may still fail on simple tasks, like counting objects in an image. Their performance also tends to decline when prompted in languages other than English, which has major implications for global deployment and fairness.
I focused on "Current Capabilities."
We documented rapid advances: AI now achieves Gold-medal performance at the Math Olympiad and agents are increasingly automating useful work, from software engineering to curriculum design.
Proud to contribute to the new International AI Safety Report chaired by @YoshuaBengio, with a fantastic international team!
Every word was weighed to ensure a rigorous, evidence-based view of current AI capabilities and the risks they pose.
A short summary of my section below.
Iโm pleased to share the Second Key Update to the International AI Safety Report, which outlines how AI developers, researchers, and policymakers are approaching technical risk management for general-purpose AI systems.
(1/6)
Personalization certainly needs boundaries and we show how that could look like!
Great fun working on this with @bminixhofer.bsky.social and Prof. Collier at @cambridgeltl.bsky.social.
Special thanks to Paul Martin, and Arcee AI's Mergekit library.
TL;DR: The alignment-calibration trade-off is real, but you don't have to be stuck with the endpoints.
Model merging provides a simple, powerful dial to find the perfect balance of capability and reliability for YOUR application.
Paper here: arxiv.org/abs/2510.17426 (8/8)
Better calibration has benefits beyond accuracy scores. It helps reduce "mode collapse" in generation tasks, leading to more diverse generations (and higher utility too), as measured on NoveltyBench. It improves model performance on group-level simulation tasks too! (7/8)
And it gets better with scale! ๐
The benefits of merging, both the accuracy boost and the stability of the "sweet spot", become even more pronounced in larger, more capable models. This echoes prior work which shows merging bigger models are more effective and stable. (6/8)
The Pareto-superior frontier is a general phenomenon we observe across model families (Gemma, Qwen), sizes, and datasets, where we can consistently find a better-balanced model. We show Qwen 2.5 results on BBH and MMLU-Pro below. (5/8)
It's NOT a zero-sum game between base and instruct.
We find a "sweet spot" merge that is Pareto-superior: it has HIGHER accuracy than both parents while substantially restoring the calibration lost during alignment. (4/8)
Our solution is simple and computationally cheap: model merging.
By interpolating between the well-calibrated base model and its capable but overconfident instruct counterpart, we create a continuous spectrum to navigate this trade-off. No retraining needed.
(3/8)
Let's start by redefining the problem. We argue the "alignment tax" MUST include the severe loss of model calibration.
Instruction tuning doesn't just nudge performance; it wrecks calibration, causing a huge spike in overconfidence. (2/8)
Instruction tuning unlocks incredible skills in LLMs, but at a cost: they become dangerously overconfident.
You face a choice: a well-calibrated base model or a capable but unreliable instruct model.
What if you didn't have to choose? What if you could navigate the trade-off?
(1/8)
River, Yinhong and I will all be in person and we look forward to the discussions!
See you next week at EMNLP!
We will be presenting our work: Scaling Low-Resource MT via Synthetic Data Generation with LLMs
๐ Poster Session 13
๐
Fri, Nov 7, 10:30-12:00 - Hall C
๐ Check it out! arxiv.org/abs/2505.14423
@helsinki-nlp.bsky.social @cambridgenlp.bsky.social @emnlpmeeting.bsky.social
Huge thanks to my amazing collaborators @joachimbaumann.bsky.social @Lorenzo Lupo @nigelcollier.bsky.social @dirkhovy.bsky.social and especially @paul-rottger.bsky.social
@cambridgeltl.bsky.social
Work partially done during my visit to @milanlp.bsky.social. Highly recommended!
Check out the paper and data for details!
Paper: arxiv.org/abs/2510.17516
Data: huggingface.co/datasets/pit...
Website: simbench.tiancheng.hu (9/9)
Overall, by making progress measurable, SImBench provides the foundation to build more faithful LLM simulators.
Moving forward, we should work on better training strategies for improving LLM social simulators. These will most likely diverge from advances in chat / coding models. (8/9)