RealWonder: Real-Time Physical Action-Conditioned Video Generation
- URL: http://arxiv.org/abs/2603.05449v1
- Date: Thu, 05 Mar 2026 18:22:54 GMT
- Title: RealWonder: Real-Time Physical Action-Conditioned Video Generation
- Authors: Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, Jiajun Wu,
- Abstract summary: We present RealWonder, the first real-time system for action-conditioned video generation from a single image.<n>RealWonder integrates 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps.<n>Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects.
- Score: 31.747349682347167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/
Related papers
- MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation [25.78198969054392]
MotionPhysics is an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt.<n>We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects.
arXiv Detail & Related papers (2026-01-01T22:56:37Z) - Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow [21.658558775915267]
We introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation.<n>Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking.<n>Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories.
arXiv Detail & Related papers (2025-12-31T10:25:24Z) - PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image [67.76547268461411]
PhysX-Anything is the first simulation-ready physical 3D generative framework.<n>It produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes.<n>It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets.
arXiv Detail & Related papers (2025-11-17T17:59:53Z) - Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets [63.67760219308476]
We present Seed3D 1.0, a foundation model that generates simulation-ready 3D assets from single images.<n>Unlike existing 3D generation models, our system produces assets with accurate geometry, well-aligned textures, and realistic physically-based materials.
arXiv Detail & Related papers (2025-10-22T18:16:32Z) - PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation [53.06495362038348]
Existing generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability.<n>We introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control.<n> Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos.
arXiv Detail & Related papers (2025-09-24T17:58:04Z) - Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation [87.91642226587294]
Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data.<n>We propose a self-distillation framework that distills the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation.<n>Our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.
arXiv Detail & Related papers (2025-09-23T17:58:01Z) - WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions [49.43000450846916]
WonderPlay is a framework integrating physics simulation and video generation.<n>It generates action-conditioned dynamic 3D scenes from a single image.<n>WonderPlay enables users to interact with various scenes of diverse content.
arXiv Detail & Related papers (2025-05-23T17:59:24Z) - PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation [29.831214435147583]
We present PhysGen, a novel image-to-video generation method.
It produces a realistic, physically plausible, and temporally consistent video.
Our key insight is to integrate model-based physical simulation with a data-driven video generation process.
arXiv Detail & Related papers (2024-09-27T17:59:57Z) - DreamPhysics: Learning Physics-Based 3D Dynamics with Video Diffusion Priors [75.83647027123119]
We propose to learn the physical properties of a material field with video diffusion priors.<n>We then utilize a physics-based Material-Point-Method simulator to generate 4D content with realistic motions.
arXiv Detail & Related papers (2024-06-03T16:05:25Z) - Learning 3D Particle-based Simulators from RGB-D Videos [15.683877597215494]
We propose a method for learning simulators directly from observations.
Visual Particle Dynamics (VPD) jointly learns a latent particle-based representation of 3D scenes.
Unlike existing 2D video prediction models, VPD's 3D structure enables scene editing and long-term predictions.
arXiv Detail & Related papers (2023-12-08T20:45:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.