NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos
- URL: http://arxiv.org/abs/2510.08568v1
- Date: Thu, 09 Oct 2025 17:59:55 GMT
- Title: NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos
- Authors: Hongyu Li, Lingfeng Sun, Yafei Hu, Duy Ta, Jennifer Barry, George Konidaris, Jiahui Fu,
- Abstract summary: NovaFlow is an autonomous manipulation framework that converts a task description into an actionable plan for a target robot.<n>We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot.
- Score: 13.84832813181084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Project website: https://novaflow.lhy.xyz/.
Related papers
- NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning [36.20611975009607]
We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning.<n>At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop.<n>To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses.
arXiv Detail & Related papers (2026-02-23T18:35:18Z) - Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow [21.658558775915267]
We introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation.<n>Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking.<n>Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories.
arXiv Detail & Related papers (2025-12-31T10:25:24Z) - Physical Autoregressive Model for Robotic Manipulation without Action Pretraining [65.8971623698511]
We build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR)<n>PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining.<n>Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task.
arXiv Detail & Related papers (2025-08-13T13:54:51Z) - 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model [40.730112146035076]
A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills.<n>Current robot datasets often record robot action in different action spaces within a simple scene.<n>We learn a 3D flow world model from both human and robot manipulation data.
arXiv Detail & Related papers (2025-06-06T16:00:31Z) - Object-centric 3D Motion Field for Robot Learning from Human Videos [56.9436352861611]
We propose to use object-centric 3D motion field to represent actions for robot learning from human videos.<n>We present a novel framework for extracting this representation from videos for zero-shot control.<n> Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method.
arXiv Detail & Related papers (2025-06-04T17:59:06Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)<n>LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.<n>We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - Flow as the Cross-Domain Manipulation Interface [73.15952395641136]
Im2Flow2Act enables robots to acquire real-world manipulation skills without the need of real-world robot training data.
Im2Flow2Act comprises two components: a flow generation network and a flow-conditioned policy.
We demonstrate Im2Flow2Act's capabilities in a variety of real-world tasks, including the manipulation of rigid, articulated, and deformable objects.
arXiv Detail & Related papers (2024-07-21T16:15:02Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.