Fugu-MT 論文翻訳(概要): Compose by Focus: Scene Graph-based Atomic Skills

論文の概要: Compose by Focus: Scene Graph-based Atomic Skills

arxiv url: http://arxiv.org/abs/2509.16053v1
Date: Fri, 19 Sep 2025 15:03:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:11.216979
Title: Compose by Focus: Scene Graph-based Atomic Skills
Title（参考訳）: フォーカスによる作曲:Scene Graphベースのアトミックスキル
Authors: Han Qi, Changhe Chen, Heng Yang,
Abstract要約: 本稿では,タスク関連オブジェクトと関係性に着目したシーングラフに基づく表現を提案する。さらに、視覚言語モデル(VLM)に基づくタスクプランナと「フォーカス」シーングラフスキルを組み合わせる。シミュレーションと実世界の操作タスクの両方の実験は、最先端のベースラインよりもはるかに高い成功率を示している。
参考スコア（独自算出の注目度）: 7.653513529718339
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A key requirement for generalist robots is compositional generalization - the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.
Abstract（参考訳）: ジェネラリストロボットの重要な要件は、構成的一般化であり、複雑な長距離タスクを解くために原子スキルを組み合わせる能力である。先行研究は主に、事前学習したスキルをシーケンスするプランナーの合成に重点を置いているが、個々のスキル自体の堅牢な実行は、シーン構成によって引き起こされる分配シフトの下で失敗することが多いため、依然として困難である。これを解決するために,タスク関連オブジェクトと関係に着目したシーングラフに基づく表現を導入し,無関係な変動に対する感受性を緩和する。このアイデアに基づいて、グラフニューラルネットワークと拡散に基づく模倣学習を統合したシーングラフスキル学習フレームワークを開発し、さらに「フォーカス」シーングラフスキルと視覚言語モデル(VLM)ベースのタスクプランナを組み合わせる。シミュレーションと実世界の操作タスクの両方の実験は、最先端のベースラインよりもはるかに高い成功率を示し、長い水平タスクにおける堅牢性の改善と構成的一般化を強調している。

論文の概要: Compose by Focus: Scene Graph-based Atomic Skills

関連論文リスト