VLM-driven Behavior Tree for Context-aware Task Planning
- URL: http://arxiv.org/abs/2501.03968v2
- Date: Fri, 10 Jan 2025 10:38:49 GMT
- Title: VLM-driven Behavior Tree for Context-aware Task Planning
- Authors: Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Kazuhiro Sasabuchi, Katsushi Ikeuchi,
- Abstract summary: We propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit Behavior Trees (BTs)
A key feature of our approach lies in the conditional control through self-prompted visual conditions.
We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
- Score: 8.07285448283823
- License:
- Abstract: The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
Related papers
- Plant in Cupboard, Orange on Table, Book on Shelf. Benchmarking Practical Reasoning and Situation Modelling in a Text-Simulated Situated Environment [18.256529559741075]
Large language models (LLMs) have risen to prominence as 'chatbots' for users to interact via natural language.
We have implemented a simple text-based environment that simulates, very abstractly, a household setting.
Our findings show that environmental complexity and game restrictions hamper performance.
arXiv Detail & Related papers (2025-02-17T12:20:39Z) - Visual Language Models as Operator Agents in the Space Domain [36.943670587532026]
Vision-Language Models (VLMs) can enhance autonomous control and decision-making in space missions.
In the software context, we employ VLMs to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers.
In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites.
arXiv Detail & Related papers (2025-01-14T03:03:37Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance [13.155859243167619]
We present SituationalLLM, a novel approach that integrates structured scene information into an large language model.
By encoding objects, attributes, and relationships in a custom Scene Graph Language, SituationalLLM actively identifies gaps in environmental context and seeks clarifications during user interactions.
Experimental results indicate that SituationalLLM outperforms generic LLM baselines in task specificity, reliability, and adaptability.
arXiv Detail & Related papers (2024-06-19T07:42:48Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - ProgPrompt: Generating Situated Robot Task Plans using Large Language
Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning.
We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.