PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models
- URL: http://arxiv.org/abs/2506.20097v1
- Date: Wed, 25 Jun 2025 02:44:20 GMT
- Title: PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models
- Authors: Wang Bill Zhu, Miaosen Chai, Ishika Singh, Robin Jia, Jesse Thomason,
- Abstract summary: We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments.<n> PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate plans and candidate symbolic semantics.
- Score: 22.688086293676328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)-based symbolic planners. However, these approaches have primarily focused on text-based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree-structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM-V increases the plan success rate from 37% (Claude-3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and Overcooked-AI, show that PSALM-V improves step efficiency and succeeds in domain induction in multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions for real-world robot BlocksWorld tasks, despite low-level manipulation failures from the robot.
Related papers
- Grounding Language Models with Semantic Digital Twins for Robotic Planning [6.474368392218828]
We introduce a novel framework that integrates Semantic Digital Twins (SDTs) with Large Language Models (LLMs)<n>The proposed framework effectively combines high-level reasoning with semantic environment understanding, achieving reliable task completion in the face of uncertainty and failure.
arXiv Detail & Related papers (2025-06-19T17:38:00Z) - Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning [2.111102681327218]
We present an approach integrating classical planning with Large Language Models.<n>We propose a hierarchical formulation that enables robots to make unfeasible tasks tractable.<n>Our method demonstrates its ability to adapt and execute tasks effectively within environments modeled using 3D Scene Graphs.
arXiv Detail & Related papers (2025-06-18T19:14:56Z) - Language-Vision Planner and Executor for Text-to-Visual Reasoning [9.140712714337273]
This paper presents an AI system that can create a step-by-step visual reasoning plan with an easy-to-understand script and execute each step of the plan in real time.<n>Inspired by recent development in large language models (LLMs) for visual reasoning, this paper presents VLAgent, an AI system that can create a step-by-step visual reasoning plan with an easy-to-understand script and execute each step of the plan in real time.
arXiv Detail & Related papers (2025-06-09T13:55:55Z) - Latent Diffusion Planning for Imitation Learning [78.56207566743154]
Latent Diffusion Planning (LDP) is a modular approach consisting of a planner and inverse dynamics model.<n>By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.<n>On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches.
arXiv Detail & Related papers (2025-04-23T17:53:34Z) - HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - Language Models can Infer Action Semantics for Symbolic Planners from Environment Feedback [26.03718733867297]
We propose Predicting Semantics of Actions with Language Models (PSALM)
PSALM learns action semantics by leveraging the strengths of both symbolic planners and Large Language Models (LLMs)
Experiments show PSALM boosts plan success rate from 36.4% (on Claude-3.5) to 100%, and explores the environment more efficiently than prior work to infer ground truth domain action semantics.
arXiv Detail & Related papers (2024-06-04T21:29:56Z) - Grounding Language Plans in Demonstrations Through Counterfactual Perturbations [25.19071357445557]
Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI.
We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks.
arXiv Detail & Related papers (2024-03-25T19:04:59Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - PROC2PDDL: Open-Domain Planning Representations from Texts [56.627183903841164]
Proc2PDDL is the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representations.
We show that Proc2PDDL is highly challenging, with GPT-3.5's success rate close to 0% and GPT-4's around 35%.
arXiv Detail & Related papers (2024-02-29T19:40:25Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - A Picture is Worth a Thousand Words: Language Models Plan from Pixels [53.85753597586226]
Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments.
In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments.
arXiv Detail & Related papers (2023-03-16T02:02:18Z) - Long-Horizon Planning and Execution with Functional Object-Oriented
Networks [79.94575713911189]
We introduce the idea of exploiting object-level knowledge as a FOON for task planning and execution.
Our approach automatically transforms FOON into PDDL and leverages off-the-shelf planners, action contexts, and robot skills.
We demonstrate our approach on long-horizon tasks in CoppeliaSim and show how learned action contexts can be extended to never-before-seen scenarios.
arXiv Detail & Related papers (2022-07-12T19:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.