Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow
- URL: http://arxiv.org/abs/2512.24766v1
- Date: Wed, 31 Dec 2025 10:25:24 GMT
- Title: Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow
- Authors: Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, Ruohan Zhang,
- Abstract summary: We introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation.<n>Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking.<n>Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories.
- Score: 21.658558775915267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories-including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos and visualizations are available at https://dream2flow.github.io/.
Related papers
- RealWonder: Real-Time Physical Action-Conditioned Video Generation [31.747349682347167]
We present RealWonder, the first real-time system for action-conditioned video generation from a single image.<n>RealWonder integrates 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps.<n>Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects.
arXiv Detail & Related papers (2026-03-05T18:22:54Z) - VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification [65.15340059997273]
VHOI is a framework for creating realistic human-object interactions in video.<n>We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics.<n> Experiments demonstrate state-of-the-art results in controllable HOI video generation.
arXiv Detail & Related papers (2025-12-10T13:40:24Z) - Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions [23.080971732537886]
Mimic2DM is a novel motion imitation framework that learns the control policy directly from 2D keypoint trajectories extracted from videos.<n>We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains.
arXiv Detail & Related papers (2025-12-09T11:30:56Z) - NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos [13.84832813181084]
NovaFlow is an autonomous manipulation framework that converts a task description into an actionable plan for a target robot.<n>We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot.
arXiv Detail & Related papers (2025-10-09T17:59:55Z) - Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation [87.91642226587294]
Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data.<n>We propose a self-distillation framework that distills the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation.<n>Our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.
arXiv Detail & Related papers (2025-09-23T17:58:01Z) - ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory [56.06314177428745]
We present ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction.<n>Our method generates robotic videos with autonomously planned 3D trajectories, significantly reducing human intervention requirements.
arXiv Detail & Related papers (2025-08-29T10:39:06Z) - 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model [40.730112146035076]
A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills.<n>Current robot datasets often record robot action in different action spaces within a simple scene.<n>We learn a 3D flow world model from both human and robot manipulation data.
arXiv Detail & Related papers (2025-06-06T16:00:31Z) - Object-centric 3D Motion Field for Robot Learning from Human Videos [56.9436352861611]
We propose to use object-centric 3D motion field to represent actions for robot learning from human videos.<n>We present a novel framework for extracting this representation from videos for zero-shot control.<n> Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method.
arXiv Detail & Related papers (2025-06-04T17:59:06Z) - Flow as the Cross-Domain Manipulation Interface [73.15952395641136]
Im2Flow2Act enables robots to acquire real-world manipulation skills without the need of real-world robot training data.
Im2Flow2Act comprises two components: a flow generation network and a flow-conditioned policy.
We demonstrate Im2Flow2Act's capabilities in a variety of real-world tasks, including the manipulation of rigid, articulated, and deformable objects.
arXiv Detail & Related papers (2024-07-21T16:15:02Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.