SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks
- URL: http://arxiv.org/abs/2509.21928v1
- Date: Fri, 26 Sep 2025 06:14:55 GMT
- Title: SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks
- Authors: Jialiang Li, Wenzheng Wu, Gaojing Zhang, Yifan Han, Wenzhao Lian,
- Abstract summary: Long-horizon manipulation tasks involve extended action sequences and complex object interactions.<n>We propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks.<n> SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image.
- Score: 3.688836621357062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Successfully solving long-horizon manipulation tasks remains a fundamental challenge. These tasks involve extended action sequences and complex object interactions, presenting a critical gap between high-level symbolic planning and low-level continuous control. To bridge this gap, two essential capabilities are required: robust long-horizon task planning and effective goal-conditioned manipulation. Existing task planning methods, including traditional and LLM-based approaches, often exhibit limited generalization or sparse semantic reasoning. Meanwhile, image-conditioned control methods struggle to adapt to unseen tasks. To tackle these problems, we propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks. SAGE utilizes semantic scene graphs as a structural representation for scene states. A structural scene graph enables bridging task-level semantic reasoning and pixel-level visuo-motor control. This also facilitates the controllable synthesis of accurate, novel sub-goal images. SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image through image inpainting and composition. Extensive experiments have demonstrated that SAGE achieves state-of-the-art performance on distinct long-horizon tasks.
Related papers
- V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks [6.820118518027692]
V-CAGE is a closed-loop framework for generating semantically aligned manipulation datasets at scale.<n>We propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis.<n>We also employ a hierarchical instruction decomposition module to bridge the gap between abstract intent and low-level control.
arXiv Detail & Related papers (2026-01-21T16:41:51Z) - I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing [59.434028565445885]
I2E is a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment.<n>I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions.<n>I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
arXiv Detail & Related papers (2026-01-07T09:29:57Z) - CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates [35.64639873984766]
We propose a training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states.<n>In addition to enhancing reliability in sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.
arXiv Detail & Related papers (2025-12-11T06:46:51Z) - Plan-over-Graph: Towards Parallelable LLM Agent Schedule [53.834646147919436]
Large Language Models (LLMs) have demonstrated exceptional abilities in reasoning for task planning.<n>This paper introduces a novel paradigm, plan-over-graph, in which the model first decomposes a real-life textual task into executable subtasks and constructs an abstract task graph.<n>The model then understands this task graph as input and generates a plan for parallel execution.
arXiv Detail & Related papers (2025-02-20T13:47:51Z) - VeriGraph: Scene Graphs for Execution Verifiable Robot Planning [33.8868315479384]
We propose VeriGraph, a framework that integrates vision-language models (VLMs) for robotic planning while verifying action feasibility.
VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement.
Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
arXiv Detail & Related papers (2024-11-15T18:59:51Z) - Image Semantic Relation Generation [0.76146285961466]
Scene graphs can distil complex image information and correct the bias of visual models using semantic-level relations.
In this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model.
arXiv Detail & Related papers (2022-10-19T16:15:19Z) - Sequential Manipulation Planning on Scene Graph [90.28117916077073]
We devise a 3D scene graph representation, contact graph+ (cg+), for efficient sequential task planning.
Goal configurations, naturally specified on contact graphs, can be produced by a genetic algorithm with an optimization method.
A task plan is then succinct by computing the Graph Editing Distance (GED) between the initial contact graphs and the goal configurations, which generates graph edit operations corresponding to possible robot actions.
arXiv Detail & Related papers (2022-07-10T02:01:33Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Latent Space Roadmap for Visual Action Planning of Deformable and Rigid
Object Manipulation [74.88956115580388]
Planning is performed in a low-dimensional latent state space that embeds images.
Our framework consists of two main components: a Visual Foresight Module (VFM) that generates a visual plan as a sequence of images, and an Action Proposal Network (APN) that predicts the actions between them.
arXiv Detail & Related papers (2020-03-19T18:43:26Z) - Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.