A bridge builder doesnβt need to denounce every evil in the world to be moral, but they better say something about the guy who keeps building bridges that topple over
A bridge builder doesnβt need to denounce every evil in the world to be moral, but they better say something about the guy who keeps building bridges that topple over
Huge thanks to my coauthors @thomasbrox.bsky.social @rajatsahay.bsky.social @simonschrodi.bsky.social @yumnaali.bsky.social, Cordelia Schmid and Volker Fischer without whom this work wouldnβt have been realized. π
If we want generative models that reason over combinations of concepts β not just produce aesthetically pleasing media β the choice of objective and conditioning matters.
Project page/paper (figures and details): lmb-freiburg.github.io/gen-comp-gen...
Paper: arxiv.org/abs/2510.03075
We validate these trends across Shapes2D/3D, CelebA, and world models (CLEVRER, CoVLA)βand see the same pattern:
continuous objectives + informative conditioning β robust compositional generalization.
(We even see early signs in language.)
What happens?
Compositional performance improves markedly.
Internal representations become more disentangled: reduced polysemanticity and less neuron overlap between concepts.
In other words, a continuous JEPA objective can inject compositional structure into a discrete model.
Discrete objectives (e.g., MaskGIT) are still attractiveβfast, and ubiquitous in LLM-style training.
based on findings, can we keep output discreteness and get compositionality?
We add a JEPA-like cont. auxiliary loss to MaskGIT, supervising middle reps in continuous space.
Conditioning is critical.
If itβs quantized or incomplete (factors missing in training), compositionality becomes fragile or fails even if all factors are given at inference.
Access to the true generative factors is essential for continuous models to generalize compositionally.
The bottleneck is fundamental and surprisingly common in recent models:
Is the objective operating in continuous or discrete space?
Across controlled comparisons, continuous-valued outputs unlock compositionality, while discrete/categorical objectives consistently lag behind.
The tokenizer isnβt the main story.
DiT can reach similar compositionality with VAE or VQ-VAE. The learning curve differs (gradual vs abrupt), but both get there.
Tokenizers mostly affect efficiency + stability, not whether compositionality is possible.
Setup: train on only a subset of factor combinations (e.g., genderΓhairΓsmile), holding out some compositions.
Then we generate + probe:
π¦ Seen (blue)
πͺ Level-1: change 1 factor (pink)
π₯ Level-2: change 2 factors (hardest/most novel) (red)
Shapes2D probesπ (shapeΓcolorΓsize)
We study 3 axes that span most modern gen models without confounders:
1οΈβ£ Tokenizer (VAE vs VQ)
2οΈβ£ Modelling & objective (diffusion vs masked autoregressive, continuous vs discrete)
3οΈβ£ Conditioning
Interventions: given our findings, can we fix non-compositional models?
In our new work, we ask a simple question:
Which design choices actually enable (or prevent) compositional generalization?
We study this in a controlled setting across visual modalitiesβcutting down the search space for anyone training or using these models.
Generalization is the goal. A core piece is compositional generalization: recombining known concepts into new combinations.
Itβs central to human intelligence, but we still donβt know what drives/hinders it in generative models and todayβs design choices are not driven by it.
Seriously, what is the goal of todayβs visual generative models?
Are pretty videos/images and low FIDs enough β or should we also demand something closer to human-like creativity? Our paper tries to answer this question π§΅
There are similarities between JEPAs and PFNs. In JEPAs, synthetic data is generated through learning. Notably, random weights can already perform well on downstream tasks, suggesting that the learning process induces useful operations on which you can do predictive coding.
Idk, but maybe not necessarily, we observe discrete tokens but the language states themselves can live in a continuous world.
Generative models that assume the underlying distribution is continuous, for example, flow matching and common diffusion models.
I really hope someone can revive continuous models for language. Theyβve taken over the visual domain by far, but getting them to work in language still feels like pure alchemy.
Excited to release our models and preprint: "Using Knowledge Graphs to harvest datasets for efficient CLIP model training"
We propose a dataset collection method using knowledge graphs and web image search, and create EntityNet-33M: a dataset of 33M images paired with 46M texts.
Over the past year, my lab has been working on fleshing out theory + applications of the Platonic Representation Hypothesis.
Today I want to share two new works on this topic:
Eliciting higher alignment: arxiv.org/abs/2510.02425
Unpaired learning of unified reps: arxiv.org/abs/2510.08492
1/9
Orbis shows that the objective matters.
Continuous modeling yields more stable and generalizable world models, yet true probabilistic coverage remains a challenge.
Immensely grateful to my co-authors @arianmousakhan.bsky.social, Sudhanshu Mittal, and Silvio Galesso, and to @thomasbrox.bsky.social
Under the hood π§
Orbis uses a hybrid tokenizer with semantic + detail tokens that work in both continuous and discrete spaces.
The world model then predicts the next frame by gradually denoising or unmasking it, using past frames as context.
Realistic and Diverse Rollouts 4/4
Realistic and Diverse Rollouts 3/4
Realistic and Diverse Rollouts 2/4
Realistic and Diverse Rollouts 1/4