Fugu-MT 論文翻訳(概要): Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

論文の概要: Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

arxiv url: http://arxiv.org/abs/2606.07436v2
Date: Thu, 11 Jun 2026 06:13:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 13:39:59.489938
Title: Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning
Title（参考訳）: Skill-3D:エージェント3次元空間推論のためのシーン認識スキルの進化
Authors: Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang,
Abstract要約: 既存の手法は、しばしばツールを誤用し、3Dシナリオ下で偏りのあるツールの好みを示す。本研究では,自己進化型シーン認識スキルを学習するフレームワークであるSkill-3Dを提案する。実験により,Skill-3Dは3次元空間推論におけるツール利用を大幅に改善することが示された。
参考スコア（独自算出の注目度）: 41.24574881549564
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 60% on VSI-Bench.
Abstract（参考訳）: 本稿では,3次元の空間的理解,すなわち3次元推論を行うMLLMエージェントについて検討する。既存の手法は、しばしばツールを誤用し、3Dシナリオ下で偏りのあるツールの好みを示し、エージェントパラダイムは非エージェント戦略よりも限界的な利得しか残らない。本研究では3次元空間推論タスクがシーン間で異質であることを明らかにする一方,これらのエージェントは特定のシーンやタスクに応じてツールを選択するのではなく,すべてのシーンに統一的なツール利用戦略を適用する。そこで我々は,自己進化型シーン認識スキルを学習するフレームワークであるSkill-3Dを提案する。具体的には、Skill-3Dは、タスクシーンを特定し、エージェントのツール使用軌跡をシーンメモリに記録する。トレーニング中、同様のシーンが再帰すると、対応するスキルを注入してエージェントを誘導し、成功と失敗を更に改善する新たなトラジェクトリを生成し、メモリとスキルライブラリが共進化するループを形成する。実験の結果,Skill-3Dは3次元空間推論におけるツール利用を著しく改善し(VSI-Benchでは39%から78%),適切なツール使用に向けてエージェントを駆動することがわかった。例えば、MMSI-BenchではGemini-3-Flashを67%改善している。さらに,Qwen3-VL-8BをVSI-Bench上で60%向上させる技術誘導軌道上でのエージェント訓練を行った。

論文の概要: Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

関連論文リスト