Tiancheng Hu (@tiancheng)

It was a great fun working on this with Caiqi @dirkhovy.bsky.social Nigel @cambridgeltl.bsky.social @milanlp.bsky.social

02.03.2026 19:09 👍 1 🔁 0 💬 0 📌 0

Position: Large Language Model Failures from Hallucination to Homogenization Are Different Facets of Miscalibration This position paper argues that diverse failures of Large Language Models (LLMs), from confident hallucinations to collapsed diversity to brittle safety refusals, are best understood as different face...

7/7 As models become ever so widely deployed in the real world, we need to build models that are not just capable, but genuinely reliable.

Let us know what you think! What LLM failure do you think is most underappreciated as a calibration problem?

Paper: www.techrxiv.org/doi/full/10....

02.03.2026 19:09 👍 1 🔁 0 💬 1 📌 0

Deploy — design interfaces that let uncertainty reach the humans acting on model outputs.

02.03.2026 19:09 👍 1 🔁 0 💬 1 📌 0

6/7 Our call to action:

Measure — make calibration metrics standard in benchmark reporting. Leaderboards track accuracy; almost none track calibration.

Train — base models start well-calibrated. We need alignment methods that don't destroy that.

02.03.2026 19:09 👍 1 🔁 0 💬 1 📌 0

5/7 The field treats each failure with a bespoke patch: RAG for hallucination, temperature for diversity, safety classifiers for over-refusal.

But these work around the model's broken uncertainty rather than fixing it. That's why the same failures keep resurfacing in new forms.

02.03.2026 19:09 👍 1 🔁 0 💬 1 📌 0

4/7 We argue these aren't separate bugs. They're four facets of the same problem:

🔴 Probabilistic — can't match requested distributions
🟠 Semantic — confidence ≠ correctness
🔵 Distributional — output diversity collapse
🟢 Metacognitive — can't assess its own competence

02.03.2026 19:09 👍 2 🔁 1 💬 1 📌 0

3/7 It goes deeper than randomness.

Ask it "what book should I read?" and it defaults to the same WEIRD-centric bestsellers. Ask a nuanced political question and it responds with near-zero variation, hallucinating consensus where none exists.

It can spend 1,000 tokens "reasoning" about 2+3.

02.03.2026 19:09 👍 1 🔁 0 💬 1 📌 0

2/7 Try this: ask any LLM for a random number between 1 and 10.

Models prefer "7" much more often than anything else. It can describe probability correctly when you ask. It just can't do it.

02.03.2026 19:09 👍 1 🔁 0 💬 1 📌 0

1/7 🧵 The GPT-4 technical report featured detailed calibration curves.

Since then, not a single major model release has reported calibration. The field quietly stopped measuring whether models know what they don't know.

Our new position paper argues this is a mistake. Here's why.

02.03.2026 19:09 👍 8 🔁 2 💬 1 📌 0

A privilege to represent @cambridgeltl.bsky.social @camlangsci.bsky.social @gatescambridge.bsky.social

Huge thanks to the entire team, the Secretariat, the Expert Advisory Panel and all reviewers.

04.02.2026 14:35 👍 1 🔁 0 💬 0 📌 0

A crucial challenge is evaluation: existing evaluation methods do not reliably reflect how systems perform in real-world settings. Nonetheless, with companies investing hundreds of billions to scale up, we expect model capabilities to continue growing.

04.02.2026 14:35 👍 0 🔁 0 💬 1 📌 0

Yet, the frontier remains "jagged": models may still fail on simple tasks, like counting objects in an image. Their performance also tends to decline when prompted in languages other than English, which has major implications for global deployment and fairness.

04.02.2026 14:35 👍 0 🔁 0 💬 1 📌 0

I focused on "Current Capabilities."
We documented rapid advances: AI now achieves Gold-medal performance at the Math Olympiad and agents are increasingly automating useful work, from software engineering to curriculum design.

04.02.2026 14:35 👍 0 🔁 2 💬 1 📌 0

Proud to contribute to the new International AI Safety Report chaired by @YoshuaBengio, with a fantastic international team!
Every word was weighed to ensure a rigorous, evidence-based view of current AI capabilities and the risks they pose.

A short summary of my section below.

04.02.2026 14:35 👍 0 🔁 0 💬 1 📌 0

I’m pleased to share the Second Key Update to the International AI Safety Report, which outlines how AI developers, researchers, and policymakers are approaching technical risk management for general-purpose AI systems.
(1/6)

25.11.2025 12:06 👍 29 🔁 11 💬 2 📌 0

Personalization certainly needs boundaries and we show how that could look like!

31.10.2025 17:24 👍 0 🔁 0 💬 0 📌 0

Great fun working on this with @bminixhofer.bsky.social and Prof. Collier at @cambridgeltl.bsky.social.

Special thanks to Paul Martin, and Arcee AI's Mergekit library.

30.10.2025 17:00 👍 1 🔁 0 💬 0 📌 0

Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model output...

TL;DR: The alignment-calibration trade-off is real, but you don't have to be stuck with the endpoints.

Model merging provides a simple, powerful dial to find the perfect balance of capability and reliability for YOUR application.

Paper here: arxiv.org/abs/2510.17426 (8/8)

30.10.2025 17:00 👍 0 🔁 0 💬 1 📌 0

Better calibration has benefits beyond accuracy scores. It helps reduce "mode collapse" in generation tasks, leading to more diverse generations (and higher utility too), as measured on NoveltyBench. It improves model performance on group-level simulation tasks too! (7/8)

30.10.2025 17:00 👍 0 🔁 0 💬 1 📌 0

And it gets better with scale! 📈
The benefits of merging, both the accuracy boost and the stability of the "sweet spot", become even more pronounced in larger, more capable models. This echoes prior work which shows merging bigger models are more effective and stable. (6/8)

30.10.2025 17:00 👍 0 🔁 0 💬 1 📌 0

The Pareto-superior frontier is a general phenomenon we observe across model families (Gemma, Qwen), sizes, and datasets, where we can consistently find a better-balanced model. We show Qwen 2.5 results on BBH and MMLU-Pro below. (5/8)

30.10.2025 17:00 👍 0 🔁 0 💬 1 📌 0

It's NOT a zero-sum game between base and instruct.
We find a "sweet spot" merge that is Pareto-superior: it has HIGHER accuracy than both parents while substantially restoring the calibration lost during alignment. (4/8)

30.10.2025 17:00 👍 0 🔁 0 💬 1 📌 0

Our solution is simple and computationally cheap: model merging.
By interpolating between the well-calibrated base model and its capable but overconfident instruct counterpart, we create a continuous spectrum to navigate this trade-off. No retraining needed.
(3/8)

30.10.2025 17:00 👍 0 🔁 0 💬 1 📌 0

Let's start by redefining the problem. We argue the "alignment tax" MUST include the severe loss of model calibration.
Instruction tuning doesn't just nudge performance; it wrecks calibration, causing a huge spike in overconfidence. (2/8)

30.10.2025 17:00 👍 0 🔁 0 💬 1 📌 0

Instruction tuning unlocks incredible skills in LLMs, but at a cost: they become dangerously overconfident.
You face a choice: a well-calibrated base model or a capable but unreliable instruct model.
What if you didn't have to choose? What if you could navigate the trade-off?
(1/8)

30.10.2025 17:00 👍 2 🔁 1 💬 1 📌 0

River, Yinhong and I will all be in person and we look forward to the discussions!

29.10.2025 21:12 👍 3 🔁 1 💬 0 📌 0

Scaling Low-Resource MT via Synthetic Data Generation with LLMs We investigate the potential of LLM-generated synthetic data for improving low-resource Machine Translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic co...

See you next week at EMNLP!
We will be presenting our work: Scaling Low-Resource MT via Synthetic Data Generation with LLMs

📍 Poster Session 13
📅 Fri, Nov 7, 10:30-12:00 - Hall C
📖 Check it out! arxiv.org/abs/2505.14423

@helsinki-nlp.bsky.social @cambridgenlp.bsky.social @emnlpmeeting.bsky.social

28.10.2025 08:16 👍 8 🔁 2 💬 0 📌 0

Huge thanks to my amazing collaborators @joachimbaumann.bsky.social @Lorenzo Lupo @nigelcollier.bsky.social @dirkhovy.bsky.social and especially @paul-rottger.bsky.social
@cambridgeltl.bsky.social
Work partially done during my visit to @milanlp.bsky.social. Highly recommended!

28.10.2025 16:53 👍 2 🔁 0 💬 0 📌 0

Check out the paper and data for details!
Paper: arxiv.org/abs/2510.17516
Data: huggingface.co/datasets/pit...
Website: simbench.tiancheng.hu (9/9)

28.10.2025 16:53 👍 4 🔁 1 💬 1 📌 0

Overall, by making progress measurable, SImBench provides the foundation to build more faithful LLM simulators.
Moving forward, we should work on better training strategies for improving LLM social simulators. These will most likely diverge from advances in chat / coding models. (8/9)

28.10.2025 16:53 👍 2 🔁 0 💬 1 📌 0

Tiancheng Hu

Latest posts by Tiancheng Hu @tiancheng