3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning
- URL: http://arxiv.org/abs/2502.08903v1
- Date: Thu, 13 Feb 2025 02:40:19 GMT
- Title: 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning
- Authors: Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao,
- Abstract summary: Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.
VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.
We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
- Score: 2.6670748466660523
- License:
- Abstract: Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.
Related papers
- A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM [0.26334346517416873]
Vision-Language-Action (VLA) models enable robots to perform complex tasks by integrating visual context with linguistic commands.
To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory.
Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates.
arXiv Detail & Related papers (2024-10-21T00:36:02Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation.
We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z) - Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? [66.6886931183372]
We introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector.
Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks.
arXiv Detail & Related papers (2024-05-28T16:57:44Z) - OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning [68.45848423501927]
We propose a holistic framework for strong alignment between agent models and 3D driving tasks.
Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D.
We propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - An Embodied Generalist Agent in 3D World [67.16935110789528]
We introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world.
We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world.
Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation.
arXiv Detail & Related papers (2023-11-18T01:21:38Z) - 3D-GPT: Procedural 3D Modeling with Large Language Models [47.72968643115063]
We introduce 3D-GPT, a framework utilizing large language models(LLMs) for instruction-driven 3D modeling.
3D-GPT positions LLMs as proficient problem solvers, dissecting the procedural 3D modeling tasks into accessible segments and appointing the apt agent for each task.
Our empirical investigations confirm that 3D-GPT not only interprets and executes instructions, delivering reliable results but also collaborates effectively with human designers.
arXiv Detail & Related papers (2023-10-19T17:41:48Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.