Related papers: Planning with Vision-Language Models and a Use Case in Robot-Assisted Teaching

Planning with Vision-Language Models and a Use Case in Robot-Assisted Teaching

URL: http://arxiv.org/abs/2501.17665v1
Date: Wed, 29 Jan 2025 14:04:54 GMT
Title: Planning with Vision-Language Models and a Use Case in Robot-Assisted Teaching
Authors: Xuzhe Dang, Lada Kudláčková, Stefan Edelkamp,
Abstract summary: This paper introduces Image2PDDL, a novel framework that leverages Vision-Language Models (VLMs) to automatically convert images of initial states and descriptions of goal states into PDDL problems.<n>We evaluate the framework on various domains, including standard planning domains like blocksworld and sliding tile puzzles, using datasets with multiple difficulty levels.<n>We will discuss a potential use case in robot-assisted teaching of students with Autism Spectrum Disorder.
Score: 0.9217021281095907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automating the generation of Planning Domain Definition Language (PDDL) with Large Language Model (LLM) opens new research topic in AI planning, particularly for complex real-world tasks. This paper introduces Image2PDDL, a novel framework that leverages Vision-Language Models (VLMs) to automatically convert images of initial states and descriptions of goal states into PDDL problems. By providing a PDDL domain alongside visual inputs, Imasge2PDDL addresses key challenges in bridging perceptual understanding with symbolic planning, reducing the expertise required to create structured problem instances, and improving scalability across tasks of varying complexity. We evaluate the framework on various domains, including standard planning domains like blocksworld and sliding tile puzzles, using datasets with multiple difficulty levels. Performance is assessed on syntax correctness, ensuring grammar and executability, and content correctness, verifying accurate state representation in generated PDDL problems. The proposed approach demonstrates promising results across diverse task complexities, suggesting its potential for broader applications in AI planning. We will discuss a potential use case in robot-assisted teaching of students with Autism Spectrum Disorder.

Related papers

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
Generating Symbolic World Models via Test-time Scaling of Large Language Models [28.258707611580643]
Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. We introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domain, achieving over 50% success rate on two tasks.
arXiv Detail & Related papers (2025-02-07T07:52:25Z)
LLM-Generated Heuristics for AI Planning: Do We Even Need Domain-Independence Anymore? [87.71321254733384]
Large language models (LLMs) can generate planning approaches tailored to specific planning problems. LLMs can achieve state-of-the-art performance on some standard IPC domains. We discuss whether these results signify a paradigm shift and how they can complement existing planning approaches.
arXiv Detail & Related papers (2025-01-30T22:21:12Z)
Multi-Agent Planning Using Visual Language Models [2.2369578015657954]
Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks.<n>LLMs andVLMs can produce erroneous results, especially when a deep understanding of the problem domain is required.<n>We propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input.
arXiv Detail & Related papers (2024-08-10T08:10:17Z)
Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages [20.62336315814875]
We introduce benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We present a dataset of $132,037$ text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty.
arXiv Detail & Related papers (2024-07-03T17:59:53Z)
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z)
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z)
Visual AI and Linguistic Intelligence Through Steerability and Composability [0.0]
This study explores the capabilities of multimodal large language models (LLMs) in handling challenging multistep tasks that integrate language and vision. The research presents a series of 14 creatively and constructively diverse tasks, ranging from AI Lego Designing to AI Satellite Image Analysis.
arXiv Detail & Related papers (2023-11-18T22:01:33Z)
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions [126.3136109870403]
We introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C) VPG-C infers and completes the missing details essential for comprehending demonstrative instructions. We build DEMON, a comprehensive benchmark for demonstrative instruction understanding.
arXiv Detail & Related papers (2023-08-08T09:32:43Z)
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control. Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z)
HDDL 2.1: Towards Defining a Formalism and a Semantics for Temporal HTN Planning [64.07762708909846]
Real world applications need modelling rich and diverse automated planning problems. hierarchical task network (HTN) formalism does not allow to represent planning problems with numerical and temporal constraints. We propose to fill the gap between HDDL and these operational needs and to extend HDDL by taking inspiration from PDDL 2.1.
arXiv Detail & Related papers (2023-06-12T18:21:23Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
HDDL 2.1: Towards Defining an HTN Formalism with Time [0.0]
Real world applications of planning, like in industry and robotics, require modelling rich and diverse scenarios. Their resolution usually requires coordinated and concurrent action executions. In several cases, such planning problems are naturally decomposed in a hierarchical way and expressed by a Hierarchical Task Network formalism. This paper opens discussions on the semantics and the syntax needed to extend HDDL, and illustrate these needs with the modelling of an Earth Observing Satellite planning problem.
arXiv Detail & Related papers (2022-06-03T21:22:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.