Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
- URL: http://arxiv.org/abs/2502.16707v1
- Date: Sun, 23 Feb 2025 20:42:15 GMT
- Title: Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
- Authors: Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, Jianlan Luo,
- Abstract summary: Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities.<n>Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems.<n>In this paper, we introduce a novel test-time framework that enhancesVLMs' physical reasoning capabilities for multi-stage manipulation tasks.
- Score: 90.00687889213991
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.
Related papers
- When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making [15.397582422113627]
Embodied decision-making is fundamental for AI agents operating in real-world environments.
In this study, we evaluate open-sourced Visual Language Models (VLMs) on multimodal human-centered decision-making tasks.
arXiv Detail & Related papers (2025-03-21T09:25:23Z) - From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment [11.799979691988902]
FOREWARN is a novel framework to unlock the potential of Vision Language Models for runtime policy steering.
For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans.
For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation.
arXiv Detail & Related papers (2025-02-03T21:11:02Z) - MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [52.739500459903724]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation.
We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents.
We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z) - Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs [0.0]
Wonderful Team is a framework for executing high-level robotic planning in a zero-shot regime.<n>We show that Wonderful Team's performance on real-world semantic and physical planning tasks often exceeds methods that rely on separate vision systems.
arXiv Detail & Related papers (2024-07-26T21:18:57Z) - Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models [81.55156507635286]
Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions.
Current learning methods often struggle with generalization to the long tail of unexpected situations without heavy human supervision.
We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection.
arXiv Detail & Related papers (2024-07-02T21:00:30Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning [11.526471286502993]
Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding.
Previous V-HOI detection models have made significant strides in accurate detection on specific datasets.
We propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR) to facilitate the performance of current V-HOI detection models.
arXiv Detail & Related papers (2024-03-15T08:51:15Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - ManipLLM: Embodied Multimodal Large Language Model for Object-Centric
Robotic Manipulation [22.071450379253235]
We introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs)
By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation.
Experiments in simulator and real-world show the promising performance of ManipLLM.
arXiv Detail & Related papers (2023-12-24T06:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.