To me personally, this shows that cool research happens when RL researchers and uncertainty researchers cook together. Huge shoutout to my colleagues Andrew Szot, Omar Attia, and Alexander Toshev from Apple Machine Learning Research π
To me personally, this shows that cool research happens when RL researchers and uncertainty researchers cook together. Huge shoutout to my colleagues Andrew Szot, Omar Attia, and Alexander Toshev from Apple Machine Learning Research π
Whatβs cool is that in some environments, our agentβs Pass@1 breaks through the base modelβs Pass@128. This implies true exploration of novel solutions. Default GRPO is capped there, indicating just better exploitation/reweighting of existing priors. π§΅ 5/6
This form of diversified exploration works across several agentic environments: phone UI navigation, planning household tasks, on-device tool calling, and coding. We train language-only, vision-language, and reasoning agents. All agents are of size 3B-8B. π§΅ 4/6
Instead, we want the agent to explore different strategies. The key: 1) We first let it generate a high-level strategy *with high temperature*, then execute it *with low temperature*, 2) We keep track of previous high-level strategies and prompt the agent to try new ones. π§΅ 3/6
The main question in RL is: How can we encourage exploration? The first thought might be to increase the temperature of sampling. But weβve observed that this just adds quite random noise, like clicking different positions on the same button (and sometimes misclicking).π§΅ 2/6
New paper π₯³ RL relies a lot on an agentβs capability to explore. Our strategy-guided exploration makes the agent find new solutions more efficiently. It learns faster, and in some environments its Pass@1 surpasses the base modelβs Pass@128. π§΅ 1/6
π arxiv.org/abs/2603.02045
Deciding which tokens are learnable and which not is not just a question of loss. Check out our new foundational research paper on training small language models that remain factual π
Correct. SelfReflect was the only resubmission. But don't want to disencourage anyone -- a lot of this success is due to being in this environment with a great team with very principled scientists around. You should control for that variable when making comparisons π€
... as well as to our academic collaborators Deepro Choudhury, Tom Rainforth, Ning Miao, Freddie Bickford Smith, Luca FΓΌger, @coallaoh.bsky.social
Huge thanks to our uncertainty nerds at Apple Machine Learning Research @sineadwilliamson.bsky.social @adamgol.bsky.social @preetumnakkiran.bsky.social Arwen Bradley, Arno Blaas, Eeshan Gunesh Dhekane, Eugene Ndiaye, Yizhe Zhang, Hadi Pouransari, David Grangier, C Thomas, @onceltuzel.bsky.social ...
[1] Semantic Calibration openreview.net/forum?id=0sC...
[2] SelfReflect openreview.net/forum?id=hOE...
[3] Bayesian Experimental Design for LLMs openreview.net/forum?id=qyy...
[4] Pretraining memories openreview.net/forum?id=XOu...
The fourth paper was a moonshot, building an LLM that stores different rare knowledge in different memory bank levels that are added to the parameters when needed [4]. Links to the papers are below, lmk if you wanna chat / have a talk about the insights between the lines :)
4 ICLR papers π₯³ Thereβs an insightful story between them: If you sample LLMs multiple times, they are calibrated, even on higher levels [1], but they cannot talk about this uncertainty in a single prompt [2], so you have to help them out to gather information Bayes-optimally [3]
In our new work β Complete(d)P β we try to answer 3 questions about hyperparameter (HP) scaling:
β How to transfer across model size, tokens&batch-size?β Complete(d)P
β Do per-module HPs matter? βοΈ2x speed-ups possible
β Do they transfer to larger scale? βοΈ With the right parameterisation
If you want some holiday reflections: This is not just a blogpost, but an insight into the philosophy of one of the best scientific minds (and best humans, really) I had the honor to share a bit of my life with.
Our research team is hiring PhD interns π Spend your next summer in Paris and explore the next frontiers of LLMs for uncertainty quantification, calibration, RL and post-training, and Bayesian experimental design.
Details & Application β‘οΈ jobs.apple.com/en-my/detail...
π’ Weβre looking for a researcher in in cogsci, neuroscience, linguistics, or related disciplines to work with us at Apple Machine Learning Research! We're hiring for a one-year interdisciplinary AIML Resident to work on understanding reasoning and decision making in LLMs. π§΅
We have been working with Michal Klein on pushing a module to train *flow matching* models using JAX. This is shipped as part of our new release of the OTT-JAX toolbox (github.com/ott-jax/ott)
The tutorial to do so is here: ott-jax.readthedocs.io/tutorials/ne...
It's that time of the year! π
The Apple Machine Learning Research (MLR) team in Paris is hiring a few interns, to do cool research for Β±6 months ππ & work towards publications/OSS.
Check requirements and apply: β‘οΈ jobs.apple.com/en-us/detail...
Moreββ βοΈ mlr_paris_internships@group.apple.com
Memories complement RAG and can be combined for enhanced results. Post-hoc memory learning is possible (for Qwen, Gemma, etc), with more ablations in the paper.
This was spearheaded by Hadi Pouransari, with David Grangier, C Thomas, me, and Oncel Tuzel at the Apple Machine Learning Research team :)
π Consider a hypothetical hardware storing a bank with three memory levels:
Anchor model: 0.8GB @ RAM
Level 1: 39GB @ Flash
Level 2: 155GB @ External Disk
Level 3: 618GB @ Cloud
Total fetch time: 38ms (vs. 198ms for a single-level flat memory bank). [9/10]
π‘With hierarchical memories, deeper memories (capturing details) need a larger bank size but require fetching only a few parameters during inferenceβa great fit for Von Neumann architecture with small-fast to large-slow storage hierarchy. Seeπ. [8/10]
π‘Information access is controllable with memories.
Unlike typical architectures, the proposed memory bank setup enables controlled parametric knowledge access (e.g., for training data privacy). See the impact of memory bank blocking on performance here: [7/10]
π‘Memories capture long-tail knowledge.
For the text completion task "Atomic number of [element-name] is...", the baseline model (purple) has 17% accuracy for the least frequent elements in DCLM (last bucket). With only 10% added memory, accuracy improves to 83%. [6/10]
π€ Which tasks benefit more from memory?
π‘ Tasks requiring specific knowledge, like ARC and TriviaQA. Below are categorizations of common pretraining benchmarks based on their knowledge specificity and accuracy improvement when a 410M model is augmented with 10% memory. [5/10]
π‘Accuracy improves with larger fetched memory and total memory bank sizes.
πA 160M anchor model, augmented with memories from 1M to 300M parameters, gains over 10 points in accuracy. Two curves show memory bank sizes of 4.6B and 18.7B parameters. [4/10]
π€ Which parametric memories work best?
π‘ We evaluate 1) FFN-memories (extending SwiGLU's internal dimension), 2) LoRA applied to various layers, and 3) Learnable KV. Larger memories perform better, with FFN-memories significantly outperforming others of the same size. [3/10]
π€ How to learn memories?
π‘ We cluster the pretraining dataset into thousands of nested clusters, each assigned a memory block. During training, for a document, we optimize anchor model parameters and memory bank parameters for the document's matched clusters. [2/10]
LLMs are currently this one big parameter block that stores all sort of facts. In our new preprint, we add context-specific memory parameters to the model, and pretrain the model along with a big bank of memories.
π arxiv.org/abs/2510.02375
[1/10]π§΅
Our two phenomenal interns, Alireza Mousavi-Hosseini and Stephen Zhang @syz.bsky.social have been cooking some really cool work with Michal Klein and me over the summer.
Relying on optimal transport couplings (to pick noise and data pairs) should, in principle, be helpful to guide flow matching
π§΅