Related papers: Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning

Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning

URL: http://arxiv.org/abs/2406.09988v2
Date: Wed, 16 Oct 2024 14:48:38 GMT
Title: Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning
Authors: Xiaowen Sun, Xufeng Zhao, Jae Hee Lee, Wenhao Lu, Matthias Kerzel, Stefan Wermter,
Abstract summary: The state of an object reflects its current status or condition and is important for a robot's task planning and manipulation. Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans. We introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks.
Score: 15.03025428687218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The state of an object reflects its current status or condition and is important for a robot's task planning and manipulation. However, detecting an object's state and generating a state-sensitive plan for robots is challenging. Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans. However, to the best of our knowledge, there is hardly any investigation on whether LLMs or VLMs can also generate object state-sensitive plans. To study this, we introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks. We propose two methods for OSSA: (i) a modular model consisting of a pre-trained vision processing module (dense captioning model, DCM) and a natural language processing model (LLM), and (ii) a monolithic model consisting only of a VLM. To quantitatively evaluate the performances of the two methods, we use tabletop scenarios where the task is to clear the table. We contribute a multimodal benchmark dataset that takes object states into consideration. Our results show that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach. The code for OSSA is available at https://github.com/Xiao-wen-Sun/OSSA

Related papers

Improving Generalization of Language-Conditioned Robot Manipulation [29.405161073483175]
We present a framework that learns object-arrangement tasks from just a few demonstrations.<n>We validate our method on both simulation and real-world robotic environments.
arXiv Detail & Related papers (2025-08-04T13:29:26Z)
VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots [44.99833362998488]
We propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments.<n>The module uses the reasoning capabilities of the Large Language Models to evaluate logical coherence and identify potential gaps in the plan.<n>We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems.
arXiv Detail & Related papers (2025-07-07T15:31:36Z)
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>It serves as an essential testing ground for Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-02-27T13:58:44Z)
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. We propose visual-temporal context, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
Vision Language Models (VLMs) can process state information as visual-textual prompts and respond with policy decisions in text. We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. Recent solutions are mainly all-in-one models. We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z)
LLMs for Robotic Object Disambiguation [21.101902684740796]
Our study reveals the LLM's aptitude for solving complex decision making challenges. A pivotal focus of our research is the object disambiguation capability of LLMs. We have developed a few-shot prompt engineering system to improve the LLM's ability to pose disambiguating queries.
arXiv Detail & Related papers (2024-01-07T04:46:23Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
LLM-State: Open World State Representation for Long-horizon Task Planning with Large Language Model [25.29170146456063]
This work addresses the problem of long-horizon task planning with the Large Language Model (LLM) in an open-world household environment. Existing works fail to explicitly track key objects and attributes, leading to erroneous decisions in long-horizon tasks. We propose an open state representation that provides continuous expansion and updating of object attributes.
arXiv Detail & Related papers (2023-11-29T07:23:22Z)
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks. Our approach is adjustable and flexible in accommodating various instruction modalities and input types. Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z)
Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents [99.17668730578586]
Pre-trained large language models (LLMs) capture procedural knowledge about the world. Plan, Eliminate, and Track (PET) framework translates a task description into a list of high-level sub-tasks. PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
arXiv Detail & Related papers (2023-05-03T20:11:22Z)
Learning Models as Functionals of Signed-Distance Fields for Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene. We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.