To bridge this 2D-to-3D gap, we propose "Render-Localize-Lift":
- Render: 3D human/object meshes into multiview 2D images.
- Localize: A Multiview Localization (MV-Loc) model, guided by VLM tokens, predicts 2D contact masks.
- Lift: 2D contact masks to 3D.
(5/10)
15.06.2025 12:23
👍 1
🔁 1
💬 1
📌 0
How can we infer 3D contact with limited 3D data? InteractVLM exploits foundational models—a VLM & localization model fine tuned to reason about contact. Given an image & prompt, the VLM outputs tokens for localization. But these models work in 2D, while contact is 3D. (4/10)
15.06.2025 12:23
👍 1
🔁 1
💬 1
📌 0
Why does 3D human-object reconstruction fail in the wild or get limited to a few object classes? A key missing piece is accurate 3D contact. InteractVLM (#CVPR2025) uses foundational models to infer contact on humans & objects, improving reconstruction from a single image. (1/10)
15.06.2025 12:23
👍 5
🔁 2
💬 1
📌 0
📢 Short deadline extension (24/2) -- One more week left to submit your application!
16.02.2025 22:42
👍 6
🔁 2
💬 0
📌 0
Passionate about Human-centric Computer Vision? 📸🤖
We’re looking for motivated PhD candidates to join our dynamic team! 🚀
26.01.2025 17:54
👍 2
🔁 0
💬 0
📌 0