Related papers: FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model

FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model

URL: http://arxiv.org/abs/2412.08261v2
Date: Sun, 16 Feb 2025 03:13:51 GMT
Title: FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model
Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, Lin Shao,
Abstract summary: We present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space.<n>FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation.<n>In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution.
Score: 2.9509867426905925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space that features three key modules: 1. a multi-modal flow generation model as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics module; and 3. a vision-language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long-horizon video generation. In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution. Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long-horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.Video demos are on our website: https://nus-lins-lab.github.io/flipweb/.

Related papers

Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z)
Future Optical Flow Prediction Improves Robot Control & Video Generation [100.87884718953099]
We introduce FOFPred, a novel optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture.<n>Our model is trained on web-scale human activity data-a highly scalable but unstructured source.<n> Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred.
arXiv Detail & Related papers (2026-01-15T18:49:48Z)
Motus: A Unified Latent Action World Model [31.62340897751899]
We propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information.<n>Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation and real-world scenarios.
arXiv Detail & Related papers (2025-12-15T06:58:40Z)
MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation [18.468025471225527]
MoWM is a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning.<n>Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model.
arXiv Detail & Related papers (2025-09-26T02:54:36Z)
Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
We propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM)<n>By restricting video generation to fixed short horizons, our approach enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-08-28T14:31:48Z)
Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis [12.160537328404622]
textttDRA-Ctrl provides new insights into reusing resource-intensive video models.<n>textttDRA-Ctrl lays foundation for future unified generative models across visual modalities.
arXiv Detail & Related papers (2025-05-29T10:34:45Z)
Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators. To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module. Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z)
iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task. It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z)
Compositional Foundation Models for Hierarchical Planning [52.18904315515153]
We propose a foundation model which leverages expert foundation model trained on language, vision and action data individually together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos.
arXiv Detail & Related papers (2023-09-15T17:44:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.