Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning
- URL: http://arxiv.org/abs/2406.09988v2
- Date: Wed, 16 Oct 2024 14:48:38 GMT
- Title: Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning
- Authors: Xiaowen Sun, Xufeng Zhao, Jae Hee Lee, Wenhao Lu, Matthias Kerzel, Stefan Wermter,
- Abstract summary: The state of an object reflects its current status or condition and is important for a robot's task planning and manipulation.
Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans.
We introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks.
- Score: 15.03025428687218
- License:
- Abstract: The state of an object reflects its current status or condition and is important for a robot's task planning and manipulation. However, detecting an object's state and generating a state-sensitive plan for robots is challenging. Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans. However, to the best of our knowledge, there is hardly any investigation on whether LLMs or VLMs can also generate object state-sensitive plans. To study this, we introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks. We propose two methods for OSSA: (i) a modular model consisting of a pre-trained vision processing module (dense captioning model, DCM) and a natural language processing model (LLM), and (ii) a monolithic model consisting only of a VLM. To quantitatively evaluate the performances of the two methods, we use tabletop scenarios where the task is to clear the table. We contribute a multimodal benchmark dataset that takes object states into consideration. Our results show that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach. The code for OSSA is available at https://github.com/Xiao-wen-Sun/OSSA
Related papers
- ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.
One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning.
We propose visual-temporal context, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
Vision Language Models (VLMs) can process state information as visual-textual prompts and respond with policy decisions in text.
We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - LLMs for Robotic Object Disambiguation [21.101902684740796]
Our study reveals the LLM's aptitude for solving complex decision making challenges.
A pivotal focus of our research is the object disambiguation capability of LLMs.
We have developed a few-shot prompt engineering system to improve the LLM's ability to pose disambiguating queries.
arXiv Detail & Related papers (2024-01-07T04:46:23Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - LLM-State: Open World State Representation for Long-horizon Task Planning with Large Language Model [25.29170146456063]
This work addresses the problem of long-horizon task planning with the Large Language Model (LLM) in an open-world household environment.
Existing works fail to explicitly track key objects and attributes, leading to erroneous decisions in long-horizon tasks.
We propose an open state representation that provides continuous expansion and updating of object attributes.
arXiv Detail & Related papers (2023-11-29T07:23:22Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Plan, Eliminate, and Track -- Language Models are Good Teachers for
Embodied Agents [99.17668730578586]
Pre-trained large language models (LLMs) capture procedural knowledge about the world.
Plan, Eliminate, and Track (PET) framework translates a task description into a list of high-level sub-tasks.
PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
arXiv Detail & Related papers (2023-05-03T20:11:22Z) - Learning Models as Functionals of Signed-Distance Fields for
Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene.
We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.