DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
Huge thanks to all my collaborators who made this project possible @hengkaipan.bsky.social, @yann-lecun.bsky.social, @lerrelpinto.com
We have open-sourced our code and data. For more details, checkout the paper and website:
Website: dino-wm.github.io
arXiv: arxiv.org/abs/2411.04983
31.01.2025 19:24
👍 2
🔁 0
💬 0
📌 0
Overall, DINO-WM takes a step toward bridging the gap between task-agnostic world modeling and reasoning and control, offering promising prospects for generic world models in real-world applications.
31.01.2025 19:24
👍 1
🔁 0
💬 1
📌 0
The object and spatial understanding priors of DINOv2 features enable robust scene understanding, essential for navigation and manipulation tasks. With this prior, DINO-WM outperforms state-of-the-art world models by 45% in downstream task performance on our hardest tasks.
31.01.2025 19:24
👍 1
🔁 0
💬 1
📌 0
DINO-WM consists of:
1️⃣An out-of-the-box DINOv2 model as the observation model.
2️⃣A causal ViT as the predictor.
3️⃣A decoder that is optional for visualization.
DINO-WM plans entirely in latent space, without the need to reconstruct pixel images.
31.01.2025 19:24
👍 1
🔁 0
💬 1
📌 0
Unlike previous works that couple world model learning with behavior learning, we train a dynamics-only model and infer actions only at test time. This allows zero-shot goal-reaching by reasoning through the dynamics—no expert demonstrations, no rewards, no online interactions.
31.01.2025 19:24
👍 1
🔁 0
💬 1
📌 0
Can we extend the power of world models beyond just online model-based learning? Absolutely!
We believe the true potential of world models lies in enabling agents to reason at test time.
Introducing DINO-WM: World Models on Pre-trained Visual Features for Zero-shot Planning.
31.01.2025 19:24
👍 20
🔁 8
💬 1
📌 1