Michael Kirchhof (@mkirchhof)

To me personally, this shows that cool research happens when RL researchers and uncertainty researchers cook together. Huge shoutout to my colleagues Andrew Szot, Omar Attia, and Alexander Toshev from Apple Machine Learning Research 🍎

06.03.2026 10:58 👍 1 🔁 0 💬 0 📌 0

What’s cool is that in some environments, our agent’s Pass@1 breaks through the base model’s Pass@128. This implies true exploration of novel solutions. Default GRPO is capped there, indicating just better exploitation/reweighting of existing priors. 🧵 5/6

06.03.2026 10:58 👍 0 🔁 0 💬 1 📌 0

This form of diversified exploration works across several agentic environments: phone UI navigation, planning household tasks, on-device tool calling, and coding. We train language-only, vision-language, and reasoning agents. All agents are of size 3B-8B. 🧵 4/6

06.03.2026 10:58 👍 0 🔁 0 💬 1 📌 0

Instead, we want the agent to explore different strategies. The key: 1) We first let it generate a high-level strategy *with high temperature*, then execute it *with low temperature*, 2) We keep track of previous high-level strategies and prompt the agent to try new ones. 🧵 3/6

06.03.2026 10:58 👍 0 🔁 0 💬 1 📌 0

The main question in RL is: How can we encourage exploration? The first thought might be to increase the temperature of sampling. But we’ve observed that this just adds quite random noise, like clicking different positions on the same button (and sometimes misclicking).🧵 2/6

06.03.2026 10:58 👍 0 🔁 0 💬 1 📌 0

New paper 🥳 RL relies a lot on an agent’s capability to explore. Our strategy-guided exploration makes the agent find new solutions more efficiently. It learns faster, and in some environments its Pass@1 surpasses the base model’s Pass@128. 🧵 1/6

📄 arxiv.org/abs/2603.02045

06.03.2026 10:58 👍 4 🔁 2 💬 1 📌 0

Deciding which tokens are learnable and which not is not just a question of loss. Check out our new foundational research paper on training small language models that remain factual 👇

13.02.2026 16:24 👍 5 🔁 0 💬 0 📌 0

Correct. SelfReflect was the only resubmission. But don't want to disencourage anyone -- a lot of this success is due to being in this environment with a great team with very principled scientists around. You should control for that variable when making comparisons 🤗

28.01.2026 19:59 👍 1 🔁 0 💬 0 📌 0

... as well as to our academic collaborators Deepro Choudhury, Tom Rainforth, Ning Miao, Freddie Bickford Smith, Luca Füger, @coallaoh.bsky.social

27.01.2026 12:49 👍 0 🔁 0 💬 0 📌 0

Huge thanks to our uncertainty nerds at Apple Machine Learning Research @sineadwilliamson.bsky.social @adamgol.bsky.social @preetumnakkiran.bsky.social Arwen Bradley, Arno Blaas, Eeshan Gunesh Dhekane, Eugene Ndiaye, Yizhe Zhang, Hadi Pouransari, David Grangier, C Thomas, @onceltuzel.bsky.social ...

27.01.2026 12:49 👍 0 🔁 0 💬 1 📌 0

[1] Semantic Calibration openreview.net/forum?id=0sC...
[2] SelfReflect openreview.net/forum?id=hOE...
[3] Bayesian Experimental Design for LLMs openreview.net/forum?id=qyy...
[4] Pretraining memories openreview.net/forum?id=XOu...

27.01.2026 12:49 👍 0 🔁 0 💬 1 📌 0

The fourth paper was a moonshot, building an LLM that stores different rare knowledge in different memory bank levels that are added to the parameters when needed [4]. Links to the papers are below, lmk if you wanna chat / have a talk about the insights between the lines :)

27.01.2026 12:49 👍 0 🔁 0 💬 1 📌 0

4 ICLR papers 🥳 There’s an insightful story between them: If you sample LLMs multiple times, they are calibrated, even on higher levels [1], but they cannot talk about this uncertainty in a single prompt [2], so you have to help them out to gather information Bayes-optimally [3]

27.01.2026 12:49 👍 9 🔁 1 💬 2 📌 0

In our new work — Complete(d)P — we try to answer 3 questions about hyperparameter (HP) scaling:
● How to transfer across model size, tokens&batch-size?→ Complete(d)P
● Do per-module HPs matter? ✔️2x speed-ups possible
● Do they transfer to larger scale? ✔️ With the right parameterisation

06.01.2026 15:21 👍 8 🔁 4 💬 1 📌 1

If you want some holiday reflections: This is not just a blogpost, but an insight into the philosophy of one of the best scientific minds (and best humans, really) I had the honor to share a bit of my life with.

19.12.2025 20:10 👍 0 🔁 0 💬 0 📌 0

Internship - Machine Learning Research on Uncertainty - Jobs at Apple (MY) Apply for a Internship - Machine Learning Research on Uncertainty job at Apple. Read about the role and find out if it’s right for you.

Our research team is hiring PhD interns 🍏 Spend your next summer in Paris and explore the next frontiers of LLMs for uncertainty quantification, calibration, RL and post-training, and Bayesian experimental design.

Details & Application ➡️ jobs.apple.com/en-my/detail...

14.11.2025 16:26 👍 3 🔁 2 💬 1 📌 0

📢 We’re looking for a researcher in in cogsci, neuroscience, linguistics, or related disciplines to work with us at Apple Machine Learning Research! We're hiring for a one-year interdisciplinary AIML Resident to work on understanding reasoning and decision making in LLMs. 🧵

07.11.2025 21:19 👍 9 🔁 5 💬 1 📌 1

We have been working with Michal Klein on pushing a module to train *flow matching* models using JAX. This is shipped as part of our new release of the OTT-JAX toolbox (github.com/ott-jax/ott)

The tutorial to do so is here: ott-jax.readthedocs.io/tutorials/ne...

05.11.2025 14:04 👍 13 🔁 7 💬 1 📌 0

It's that time of the year! 🎁

The Apple Machine Learning Research (MLR) team in Paris is hiring a few interns, to do cool research for ±6 months 🚀🚀 & work towards publications/OSS.

Check requirements and apply: ➡️ jobs.apple.com/en-us/detail...

More❓→ ✉️ mlr_paris_internships@group.apple.com

17.10.2025 13:07 👍 7 🔁 4 💬 0 📌 0

Memories complement RAG and can be combined for enhanced results. Post-hoc memory learning is possible (for Qwen, Gemma, etc), with more ablations in the paper.

This was spearheaded by Hadi Pouransari, with David Grangier, C Thomas, me, and Oncel Tuzel at the Apple Machine Learning Research team :)

06.10.2025 16:06 👍 4 🔁 0 💬 0 📌 0

🚀 Consider a hypothetical hardware storing a bank with three memory levels:
Anchor model: 0.8GB @ RAM
Level 1: 39GB @ Flash
Level 2: 155GB @ External Disk
Level 3: 618GB @ Cloud

Total fetch time: 38ms (vs. 198ms for a single-level flat memory bank). [9/10]

06.10.2025 16:06 👍 2 🔁 0 💬 1 📌 0

💡With hierarchical memories, deeper memories (capturing details) need a larger bank size but require fetching only a few parameters during inference—a great fit for Von Neumann architecture with small-fast to large-slow storage hierarchy. See👇. [8/10]

06.10.2025 16:06 👍 1 🔁 0 💬 1 📌 0

💡Information access is controllable with memories.

Unlike typical architectures, the proposed memory bank setup enables controlled parametric knowledge access (e.g., for training data privacy). See the impact of memory bank blocking on performance here: [7/10]

06.10.2025 16:06 👍 0 🔁 0 💬 1 📌 0

💡Memories capture long-tail knowledge.

For the text completion task "Atomic number of [element-name] is...", the baseline model (purple) has 17% accuracy for the least frequent elements in DCLM (last bucket). With only 10% added memory, accuracy improves to 83%. [6/10]

06.10.2025 16:06 👍 0 🔁 0 💬 1 📌 0

🤔 Which tasks benefit more from memory?

💡 Tasks requiring specific knowledge, like ARC and TriviaQA. Below are categorizations of common pretraining benchmarks based on their knowledge specificity and accuracy improvement when a 410M model is augmented with 10% memory. [5/10]

06.10.2025 16:06 👍 0 🔁 0 💬 1 📌 0

💡Accuracy improves with larger fetched memory and total memory bank sizes.

👇A 160M anchor model, augmented with memories from 1M to 300M parameters, gains over 10 points in accuracy. Two curves show memory bank sizes of 4.6B and 18.7B parameters. [4/10]

06.10.2025 16:06 👍 0 🔁 0 💬 1 📌 0

🤔 Which parametric memories work best?

💡 We evaluate 1) FFN-memories (extending SwiGLU's internal dimension), 2) LoRA applied to various layers, and 3) Learnable KV. Larger memories perform better, with FFN-memories significantly outperforming others of the same size. [3/10]

06.10.2025 16:06 👍 0 🔁 0 💬 1 📌 0

🤔 How to learn memories?

💡 We cluster the pretraining dataset into thousands of nested clusters, each assigned a memory block. During training, for a document, we optimize anchor model parameters and memory bank parameters for the document's matched clusters. [2/10]

06.10.2025 16:06 👍 0 🔁 0 💬 1 📌 0

LLMs are currently this one big parameter block that stores all sort of facts. In our new preprint, we add context-specific memory parameters to the model, and pretrain the model along with a big bank of memories.

📑 arxiv.org/abs/2510.02375

[1/10]🧵

06.10.2025 16:06 👍 13 🔁 4 💬 1 📌 0

Our two phenomenal interns, Alireza Mousavi-Hosseini and Stephen Zhang @syz.bsky.social have been cooking some really cool work with Michal Klein and me over the summer.

Relying on optimal transport couplings (to pick noise and data pairs) should, in principle, be helpful to guide flow matching

🧵

03.10.2025 20:50 👍 30 🔁 7 💬 2 📌 1

Michael Kirchhof

Latest posts by Michael Kirchhof @mkirchhof