Martin Gubri's Avatar

Martin Gubri

@mgubri

Research Lead @parameterlab.bsky.social working on Trustworthy AI Speaking πŸ‡«πŸ‡·, English and πŸ‡¨πŸ‡± Spanish | Living in TΓΌbingen πŸ‡©πŸ‡ͺ | he/him https://gubri.eu

130
Followers
452
Following
57
Posts
18.11.2024
Joined
Posts Following

Latest posts by Martin Gubri @mgubri

These models do not seem to be supported by together.ai, but there is a form to request new models: docs.together.ai/docs/fine-tu...
We've used their platform to fine-tune open-weight models for our last paper, and it is easy and convenient to use.

18.02.2026 14:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
What those AI benchmark numbers mean | ngrok blog An explanation of 14 benchmarks you're likely to see when new models are released.

If you want to get up to speed on what all the benchmarks mean, I wrote a bunch of digests for the popular ones over on the ngrok blog. Designed for people that are interested but not enough to go read all the papers.

ngrok.com/blog/ai-benc...

05.02.2026 20:06 πŸ‘ 7 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

New paper out!πŸŽ‰

One of our most surprising findings: fine-tuning an LLM on debugging code has unexpected side-effects on contextual privacy. The model learns from printing variables that internal state are ok to share, then generalises this to social situations🀯

A🧡belowπŸ‘‡

03.02.2026 17:11 πŸ‘ 5 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

πŸŽ‰Thrilled to share that both of my #ICLR2026 submissions were accepted (2/2)!

πŸͺ© DISCO, Efficient Benchmarking: bsky.app/profile/arub...
🩺 Dr.LLM, Dynamic Layer Routing: www.linkedin.com/posts/ahmed-...

Huge thanks to my co-authors, especially first authors @arubique.bsky.social & Ahmed Heakl!

28.01.2026 13:48 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Kudos to the GAPERON team @nthngdy.bsky.social @wissamantoun.bsky.social Rian Touchent, @rachelbawden.bsky.social Γ‰ric de la Clergerie, @bensagot.bsky.social & DjamΓ© Seddah
for the thorough experiments and for saying the quiet parts out loud. We need more papers like this :)

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Gaperon: A Peppered English-French Generative Language Model Suite We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B...

Full paper: arxiv.org/abs/2510.25771

Key sections:

5.3: Deliberate contamination experiments
7.2.1: Evidence of contamination in existing models
7.2.2: How quality filters amplify leakage
7.2.3 + Appendix C: Game-theoretic modelling

23.01.2026 17:48 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

How to fix this?

Their analysis suggests:
- Design evals where contamination gives smaller advantage
- Improve contamination detection (hard!)
- Make the community value generation quality over benchmark scores <- my favorite :)

Until then, the game theory says: contaminate (knowingly or not)

23.01.2026 17:48 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

They even model contamination as a game theory problem (Section 7.2.3 + Appendix C)
Key insight: if benchmark advantage (m) exceeds direct costs (Ξ±), and detection probability p(c) is smooth enough, there exists an equilibrium contamination level c* > 0 where *no one* benefits from decontaminating

23.01.2026 17:48 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I don't fully agree with this choice, but I understand it. And I strongly appreciate the honesty.
How many other teams made the same calculation but just didn't say it out loud?

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

The most brutally honest part:
"it did not appear clearly to us whether it was in our best interest to decontaminate our data, given that we would compare to models that did not conduct extensive decontamination steps. As a result, we decided to not conduct such decontamination effort"

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Pushing for "educational" data inadvertently surfaces MCQ benchmarks. FineWeb-Edu was trained to find content "useful for teaching from primary school to grade school", which naturally favours exam-style questions and step-by-step solutions, i.e. exactly what MMLU and GSM8k look like.

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

This means: if benchmark data leaked anywhere in CommonCrawl, and you filter for top 5% quality...
You've just 20x'd your contamination rate!

23.01.2026 17:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

The culprit? Data quality filters.
They ran a "Benchmark In A Haystack" experiment and found that quality classifiers systematically rank benchmark samples as high quality
DCLM classifier puts *all* MMLU and GSM8k samples in the top 5 percentiles

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

For MMLU, they found 24% of questions appear verbatim in OLMo-2's training set (vs just 1% for OLMo-1)
All models perform better on these "contaminated" samples, with Llama-3.1-8B showing +10.9 points on STEM and +14.2 on Humanities

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

To be clear: I love OLMo! Their transparency is what makes this analysis possible in the first place. Most models aren't this open and are very likely worse.

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Here's where it gets spicy 🌢️
They found evidence of contamination in OLMo and EuroLLM training data for Hellaswag and Lambada benchmarks.
OLMo-2 performs +4.3 points better on WikiHow samples that were exactly matched in its training data

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

But there's a catch.
Contaminated training hurts generation quality, especially on "creative and semantic aspects". Coherence, Style, and Originality each drop ~0.5 points in LLM-as-judge evaluations. Grammar stays stable though.

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

The authors deliberately trained models on benchmark test sets to study contamination effects.
Surprising finding: deliberate contamination improves performance even on *held-out* benchmarks not included in training.
+17 points on CareQA, with no degradation on other unseen tasks.

23.01.2026 17:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
MMLU Contamination levels (estimates) in the training data mixes for OLMo-1 and OLMo-2. Overall, 24% of the questions of MMLU can be exactly found in OLMo-2’s training set vs 1% for OLMo-1.

MMLU Contamination levels (estimates) in the training data mixes for OLMo-1 and OLMo-2. Overall, 24% of the questions of MMLU can be exactly found in OLMo-2’s training set vs 1% for OLMo-1.

🧡 Many hidden gems about LLM benchmark contamination in the GAPERON paper!

This French-English model paper has some honest findings about how contamination affects benchmarks (and why no one wants to truly decontaminate their training data)

Thread πŸ‘‡

23.01.2026 17:48 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

And I've maintained a special attachment to it ever since, not just as a venue for great research, but as a uniquely human-scale place to connect with wonderful people in our community. Looking forward to contributing back to a conference that has meant a lot to my journey! 2/2

19.12.2025 09:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Delighted to announce that 3.5 years after my first first-author paper was accepted at UAI 2022, I've been appointed Area Chair for UAI 2026! 😊
UAI was my first in-person conference right after COVID 1/2

19.12.2025 09:48 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Our #EMNLP2025 paper Leaky Thoughts πŸ«— shows that Large Reasoning Models (LRMs) can unintentionally leak sensitive information hidden in their internal thoughts.

πŸ“ Come chat with Tommaso at our poster on Friday 7th, 10:30–12:00 in Hall C3
πŸ“„ aclanthology.org/2025.emnlp-m...

04.11.2025 21:45 πŸ‘ 2 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

BTW you might be interested in our TRAP paper (ACL Findings 2024), where we propose an intrinsic fingerprint method based on prompt optimization to find unique input-output pairs: bsky.app/profile/mgub...
4/4

31.10.2025 18:31 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I think that the main difference is that model fingerprinting lets the verifier pick the inputs, while an output fingerprint would make any generated output identifiable. Always happy to exchange thoughts if you're interested :)
3/

31.10.2025 18:31 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

To me, your method looks like a new category: output fingerprint (something I've been thinking about for some time). Kind of like watermarking, where you have output (eg. red/green) and model (eg. instructional) watermarks.
2/

31.10.2025 18:31 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Very nice paper, congrats! I really like it.

I have a few questions:
- Is the ellipse signature robust to noise added to the logits?
- can we compute the signature if we only have access to the top-k logits?

1/

31.10.2025 18:31 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

πŸͺ© New paper out!

Evaluating large models on benchmarks like MMLU is expensive. DISCO cuts costs by up to 99% while still predicting well performance.

πŸ” The trick: use a small subset of samples where models disagree the most. These are the most informative.

Join the dance party below πŸ‘‡

13.10.2025 09:29 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

They found the universal intro for all papers:
<insert name> should be correct. But in reality, that is rarely true.

11.09.2025 15:35 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Thanks a lot Guillaume :)

21.08.2025 16:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

πŸŽ‰ Delighted to announce that our πŸ«—Leaky Thoughts paper about contextual privacy with reasoning models is accepted to #EMNLP main!
Huge congrats to the amazing team Tommaso Green, Haritz Puerto @coallaoh.bsky.social @oodgnas.bsky.social

21.08.2025 15:16 πŸ‘ 6 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0