FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model
- URL: http://arxiv.org/abs/2412.08261v2
- Date: Sun, 16 Feb 2025 03:13:51 GMT
- Title: FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model
- Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, Lin Shao,
- Abstract summary: We present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space.<n>FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation.<n>In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution.
- Score: 2.9509867426905925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space that features three key modules: 1. a multi-modal flow generation model as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics module; and 3. a vision-language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long-horizon video generation. In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution. Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long-horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.Video demos are on our website: https://nus-lins-lab.github.io/flipweb/.
Related papers
- Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z) - Future Optical Flow Prediction Improves Robot Control & Video Generation [100.87884718953099]
We introduce FOFPred, a novel optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture.<n>Our model is trained on web-scale human activity data-a highly scalable but unstructured source.<n> Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred.
arXiv Detail & Related papers (2026-01-15T18:49:48Z) - Motus: A Unified Latent Action World Model [31.62340897751899]
We propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information.<n>Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation and real-world scenarios.
arXiv Detail & Related papers (2025-12-15T06:58:40Z) - MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation [18.468025471225527]
MoWM is a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning.<n>Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model.
arXiv Detail & Related papers (2025-09-26T02:54:36Z) - Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
We propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM)<n>By restricting video generation to fixed short horizons, our approach enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-08-28T14:31:48Z) - Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis [12.160537328404622]
textttDRA-Ctrl provides new insights into reusing resource-intensive video models.<n>textttDRA-Ctrl lays foundation for future unified generative models across visual modalities.
arXiv Detail & Related papers (2025-05-29T10:34:45Z) - Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.
To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.
Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning.
Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z) - iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making.
This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens.
iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z) - InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z) - Compositional Foundation Models for Hierarchical Planning [52.18904315515153]
We propose a foundation model which leverages expert foundation model trained on language, vision and action data individually together to solve long-horizon tasks.
We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model.
Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos.
arXiv Detail & Related papers (2023-09-15T17:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.