LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation
- URL: http://arxiv.org/abs/2512.21243v1
- Date: Wed, 24 Dec 2025 15:36:21 GMT
- Title: LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation
- Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov,
- Abstract summary: Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread.<n>One solution is to use a scene graph that contains all the necessary information.<n>Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning.<n>We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors.
- Score: 47.99822253865053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .
Related papers
- MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning [44.61781303455069]
Mobile manipulators in households must both navigate and manipulate.<n>This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable.<n>We introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements.
arXiv Detail & Related papers (2025-12-18T18:59:03Z) - Plan-over-Graph: Towards Parallelable LLM Agent Schedule [53.834646147919436]
Large Language Models (LLMs) have demonstrated exceptional abilities in reasoning for task planning.<n>This paper introduces a novel paradigm, plan-over-graph, in which the model first decomposes a real-life textual task into executable subtasks and constructs an abstract task graph.<n>The model then understands this task graph as input and generates a plan for parallel execution.
arXiv Detail & Related papers (2025-02-20T13:47:51Z) - VeriGraph: Scene Graphs for Execution Verifiable Robot Planning [33.8868315479384]
We propose VeriGraph, a framework that integrates vision-language models (VLMs) for robotic planning while verifying action feasibility.
VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement.
Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
arXiv Detail & Related papers (2024-11-15T18:59:51Z) - SayPlan: Grounding Large Language Models using 3D Scene Graphs for
Scalable Robot Task Planning [15.346150968195015]
We introduce SayPlan, a scalable approach to large-scale task planning for robotics using 3D scene graph (3DSG) representations.
We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects.
arXiv Detail & Related papers (2023-07-12T12:37:55Z) - A Picture is Worth a Thousand Words: Language Models Plan from Pixels [53.85753597586226]
Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments.
In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments.
arXiv Detail & Related papers (2023-03-16T02:02:18Z) - Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query.
We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE)
We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z) - Sequential Manipulation Planning on Scene Graph [90.28117916077073]
We devise a 3D scene graph representation, contact graph+ (cg+), for efficient sequential task planning.
Goal configurations, naturally specified on contact graphs, can be produced by a genetic algorithm with an optimization method.
A task plan is then succinct by computing the Graph Editing Distance (GED) between the initial contact graphs and the goal configurations, which generates graph edit operations corresponding to possible robot actions.
arXiv Detail & Related papers (2022-07-10T02:01:33Z) - Hallucinative Topological Memory for Zero-Shot Visual Planning [86.20780756832502]
In visual planning (VP), an agent learns to plan goal-directed behavior from observations of a dynamical system obtained offline.
Most previous works on VP approached the problem by planning in a learned latent space, resulting in low-quality visual plans.
Here, we propose a simple VP method that plans directly in image space and displays competitive performance.
arXiv Detail & Related papers (2020-02-27T18:54:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.