Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
- URL: http://arxiv.org/abs/2508.17437v2
- Date: Tue, 26 Aug 2025 16:57:07 GMT
- Title: Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
- Authors: Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, Lingjie Liu,
- Abstract summary: PIXIE trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses.<n>PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods.<n>By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data.
- Score: 46.76145349237445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known datasets of paired 3D assets and physic material annotations. Extensive evaluations demonstrate that PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data. https://pixie-3d.github.io/
Related papers
- TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos [7.616167860385134]
We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes.<n>By formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle.
arXiv Detail & Related papers (2025-08-13T13:43:01Z) - FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity [15.375932203870594]
We aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos.<n>In this paper, we propose FreeGave to learn the physics of complex dynamic 3D scenes without needing any object priors.
arXiv Detail & Related papers (2025-06-09T15:31:25Z) - Learning 3D-Gaussian Simulators from RGB Videos [20.250137125726265]
3DGSim is a learned 3D simulator that learns physical interactions from multi-view RGB videos.<n>It unifies 3D scene reconstruction, particle dynamics prediction and video synthesis into an end-to-end trained framework.
arXiv Detail & Related papers (2025-03-31T12:33:59Z) - Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video [58.043569985784806]
We introduce latent intuitive physics, a transfer learning framework for physics simulation.
It can infer hidden properties of fluids from a single 3D video and simulate the observed fluid in novel scenes.
We validate our model in three ways: (i) novel scene simulation with the learned visual-world physics, (ii) future prediction of the observed fluid dynamics, and (iii) supervised particle simulation.
arXiv Detail & Related papers (2024-06-18T16:37:44Z) - FLARE: Fast Learning of Animatable and Relightable Mesh Avatars [64.48254296523977]
Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems.
We introduce FLARE, a technique that enables the creation of animatable and relightable avatars from a single monocular video.
arXiv Detail & Related papers (2023-10-26T16:13:00Z) - Differentiable Blocks World: Qualitative 3D Decomposition by Rendering
Primitives [70.32817882783608]
We present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives.
Unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images.
We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points.
arXiv Detail & Related papers (2023-07-11T17:58:31Z) - 3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive
Physics under Challenging Scenes [68.66237114509264]
We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids.
We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
arXiv Detail & Related papers (2023-04-22T19:28:49Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.