Karim Farid (@kifarid) — bluesky.baby

A bridge builder doesn’t need to denounce every evil in the world to be moral, but they better say something about the guy who keeps building bridges that topple over

08.01.2026 05:23 👍 277 🔁 29 💬 3 📌 7

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these model...

Well this is exciting: arxiv.org/abs/2512.20605

06.01.2026 19:53 👍 54 🔁 7 💬 1 📌 0

Huge thanks to my coauthors @thomasbrox.bsky.social @rajatsahay.bsky.social @simonschrodi.bsky.social @yumnaali.bsky.social, Cordelia Schmid and Volker Fischer without whom this work wouldn’t have been realized. 🙏

17.12.2025 06:33 👍 0 🔁 0 💬 0 📌 0

What Drives Compositional Generalization in Visual Generative Models? Continuous objectives + full conditioning drive robust compositionality; discrete categorical losses hinder it. JEPA-style auxiliary loss improves MaskGIT.

If we want generative models that reason over combinations of concepts – not just produce aesthetically pleasing media – the choice of objective and conditioning matters.

Project page/paper (figures and details): lmb-freiburg.github.io/gen-comp-gen...
Paper: arxiv.org/abs/2510.03075

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

We validate these trends across Shapes2D/3D, CelebA, and world models (CLEVRER, CoVLA)—and see the same pattern:

continuous objectives + informative conditioning ⇒ robust compositional generalization.

(We even see early signs in language.)

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

What happens?

Compositional performance improves markedly.

Internal representations become more disentangled: reduced polysemanticity and less neuron overlap between concepts.

In other words, a continuous JEPA objective can inject compositional structure into a discrete model.

17.12.2025 06:30 👍 1 🔁 0 💬 1 📌 0

Discrete objectives (e.g., MaskGIT) are still attractive—fast, and ubiquitous in LLM-style training.

based on findings, can we keep output discreteness and get compositionality?

We add a JEPA-like cont. auxiliary loss to MaskGIT, supervising middle reps in continuous space.

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

Conditioning is critical.

If it’s quantized or incomplete (factors missing in training), compositionality becomes fragile or fails even if all factors are given at inference.

Access to the true generative factors is essential for continuous models to generalize compositionally.

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

The bottleneck is fundamental and surprisingly common in recent models:

Is the objective operating in continuous or discrete space?

Across controlled comparisons, continuous-valued outputs unlock compositionality, while discrete/categorical objectives consistently lag behind.

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

The tokenizer isn’t the main story.

DiT can reach similar compositionality with VAE or VQ-VAE. The learning curve differs (gradual vs abrupt), but both get there.

Tokenizers mostly affect efficiency + stability, not whether compositionality is possible.

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

Setup: train on only a subset of factor combinations (e.g., gender×hair×smile), holding out some compositions.

Then we generate + probe:
🟦 Seen (blue)
🟪 Level-1: change 1 factor (pink)
🟥 Level-2: change 2 factors (hardest/most novel) (red)
Shapes2D probes👇 (shape×color×size)

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

We study 3 axes that span most modern gen models without confounders:

1️⃣ Tokenizer (VAE vs VQ)

2️⃣ Modelling & objective (diffusion vs masked autoregressive, continuous vs discrete)

3️⃣ Conditioning

Interventions: given our findings, can we fix non-compositional models?

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

In our new work, we ask a simple question:

Which design choices actually enable (or prevent) compositional generalization?

We study this in a controlled setting across visual modalities—cutting down the search space for anyone training or using these models.

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

Generalization is the goal. A core piece is compositional generalization: recombining known concepts into new combinations.

It’s central to human intelligence, but we still don’t know what drives/hinders it in generative models and today’s design choices are not driven by it.

17.12.2025 06:30 👍 0 🔁 0 💬 1 📌 0

Seriously, what is the goal of today’s visual generative models?

Are pretty videos/images and low FIDs enough – or should we also demand something closer to human-like creativity? Our paper tries to answer this question 🧵

17.12.2025 06:30 👍 1 🔁 0 💬 1 📌 0

There are similarities between JEPAs and PFNs. In JEPAs, synthetic data is generated through learning. Notably, random weights can already perform well on downstream tasks, suggesting that the learning process induces useful operations on which you can do predictive coding.

17.10.2025 07:38 👍 2 🔁 0 💬 0 📌 0

Idk, but maybe not necessarily, we observe discrete tokens but the language states themselves can live in a continuous world.

14.10.2025 12:43 👍 0 🔁 0 💬 1 📌 0

Generative models that assume the underlying distribution is continuous, for example, flow matching and common diffusion models.

13.10.2025 14:20 👍 0 🔁 0 💬 1 📌 0

I really hope someone can revive continuous models for language. They’ve taken over the visual domain by far, but getting them to work in language still feels like pure alchemy.

12.10.2025 19:31 👍 4 🔁 0 💬 1 📌 0

Using Knowledge Graphs to harvest datasets for efficient CLIP model training Training high-quality CLIP models typically requires enormous datasets, which limits the development of domain-specific models -- especially in areas that even the largest CLIP models do not cover wel...

Excited to release our models and preprint: "Using Knowledge Graphs to harvest datasets for efficient CLIP model training"

We propose a dataset collection method using knowledge graphs and web image search, and create EntityNet-33M: a dataset of 33M images paired with 46M texts.

08.05.2025 12:58 👍 1 🔁 2 💬 2 📌 0

Over the past year, my lab has been working on fleshing out theory + applications of the Platonic Representation Hypothesis.

Today I want to share two new works on this topic:

Eliciting higher alignment: arxiv.org/abs/2510.02425
Unpaired learning of unified reps: arxiv.org/abs/2510.08492

1/9

10.10.2025 22:13 👍 133 🔁 33 💬 1 📌 5

Orbis shows that the objective matters.
Continuous modeling yields more stable and generalizable world models, yet true probabilistic coverage remains a challenge.

Immensely grateful to my co-authors @arianmousakhan.bsky.social, Sudhanshu Mittal, and Silvio Galesso, and to @thomasbrox.bsky.social

12.10.2025 15:51 👍 1 🔁 0 💬 0 📌 0

Under the hood 🧠

Orbis uses a hybrid tokenizer with semantic + detail tokens that work in both continuous and discrete spaces.
The world model then predicts the next frame by gradually denoising or unmasking it, using past frames as context.

12.10.2025 15:31 👍 1 🔁 0 💬 1 📌 0

Realistic and Diverse Rollouts 4/4

12.10.2025 15:26 👍 1 🔁 0 💬 1 📌 0

Realistic and Diverse Rollouts 3/4

12.10.2025 15:25 👍 1 🔁 0 💬 1 📌 0

Realistic and Diverse Rollouts 2/4

12.10.2025 15:25 👍 1 🔁 0 💬 1 📌 0

Realistic and Diverse Rollouts 1/4

12.10.2025 15:24 👍 1 🔁 0 💬 1 📌 0

Karim Farid

Latest posts by Karim Farid @kifarid