Compose by Focus: Scene Graph-based Atomic Skills
- URL: http://arxiv.org/abs/2509.16053v1
- Date: Fri, 19 Sep 2025 15:03:18 GMT
- Title: Compose by Focus: Scene Graph-based Atomic Skills
- Authors: Han Qi, Changhe Chen, Heng Yang,
- Abstract summary: We introduce a scene graph-based representation that focuses on task-relevant objects and relations.<n>We further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner.<n> Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines.
- Score: 7.653513529718339
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key requirement for generalist robots is compositional generalization - the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.
Related papers
- Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations [16.68801520494275]
We introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations.<n>We show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability.
arXiv Detail & Related papers (2026-01-16T17:35:00Z) - VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning [68.98988753763666]
We propose VisualCloze, a universal image generation framework.<n>VisualCloze supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation.<n>We introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge.
arXiv Detail & Related papers (2025-04-10T17:59:42Z) - Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following [50.377287115281476]
We show that learning to associate the representations of current and future states with a temporal loss can improve compositional generalization.<n>We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.
arXiv Detail & Related papers (2025-02-08T05:26:29Z) - COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.<n>It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.<n>It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - Semantic-Geometric-Physical-Driven Robot Manipulation Skill Transfer via Skill Library and Tactile Representation [6.324290412766366]
We propose a knowledge graph-based skill library construction method to organize manipulation knowledge.<n>We also propose a novel hierarchical skill transfer framework based on the skill library and tactile representation.<n> Experiments demonstrate the skill transfer and adaptability capabilities of the proposed methods.
arXiv Detail & Related papers (2024-11-18T16:42:07Z) - SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution [75.2573501625811]
Diffusion models have demonstrated strong potential for robotic trajectory planning.
generating coherent trajectories from high-level instructions remains challenging.
We propose SkillDiffuser, an end-to-end hierarchical planning framework.
arXiv Detail & Related papers (2023-12-18T18:16:52Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Self-supervised Visual Reinforcement Learning with Object-centric
Representations [11.786249372283562]
We propose to use object-centric representations as a modular and structured observation space.
We show that the structure in the representations in combination with goal-conditioned attention policies helps the autonomous agent to discover and learn useful skills.
arXiv Detail & Related papers (2020-11-29T14:55:09Z) - Transforming task representations to perform novel tasks [12.008469282323492]
An important aspect of intelligence is the ability to adapt to a novel task without any direct experience (zero-shot)
We propose a general computational framework for adapting to novel tasks based on their relationship to prior tasks.
arXiv Detail & Related papers (2020-05-08T23:41:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.