Fugu-MT 論文翻訳(概要): TRACE: Textual Reasoning for Affordance Coordinate Extraction

論文の概要: TRACE: Textual Reasoning for Affordance Coordinate Extraction

arxiv url: http://arxiv.org/abs/2511.01999v1
Date: Mon, 03 Nov 2025 19:13:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:05.660084
Title: TRACE: Textual Reasoning for Affordance Coordinate Extraction
Title（参考訳）: TRACE: Affordance Coordinate extractのためのテキスト推論
Authors: Sangyun Park, Jin Kim, Yuchen Cui, Matthew S. Brown,
Abstract要約: VLM(Vision-Language Models)は、ロボット操作に必要な正確な空間的余裕に高レベルの命令を翻訳するのに苦労する。本稿では,Reasoningのテキスト・チェーンをアベイランス・予測プロセスに統合する新しい手法であるTRACEを紹介する。実験の結果,提案するTRACEモデルでは最先端性能が得られ,Where2Placeベンチマークでは48.1%の精度が得られた。
参考スコア（独自算出の注目度）: 4.374024319540872
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more challenging W2P(h) subset. Crucially, an ablation study demonstrates that performance scales directly with the amount of reasoning data used, confirming the CoR's effectiveness. Furthermore, analysis of the model's attention maps reveals an interpretable reasoning process where focus shifts dynamically across reasoning steps. This work shows that training VLMs to generate a textual CoR is an effective and robust strategy for enhancing the precision, reliability, and interpretability of VLM-based robot control. Our dataset and code are available at https://github.com/jink-ucla/TRACE
Abstract（参考訳）: VLM(Vision-Language Models)は、ロボット操作に必要な正確な空間的余裕に高レベルの命令を翻訳するのに苦労する。ビジュアル・チェーン・オブ・ソート(CoT)法は存在するが、しばしば計算集約的である。本研究は,提案手法であるTRACE(Textual Reasoning for Affordance Coordinate extract)を紹介する。この手法を用いてTRACEデータセットを作成する。これは、命令と明示的なテキストの合理性とをペアリングする自律パイプラインを通じて作成された大規模なコレクションである。このデータにVLMを微調整することで、我々のモデルは行動する前にその空間的推論を外部化することを学ぶ。実験の結果,提案するTRACEモデルでは,W2Place (W2P) ベンチマークで48.1%,難易度の高いW2P(h) サブセットで55.0%の精度が得られた。重要なことは、アブレーション研究は、性能が使用する推論データ量と直接的にスケールすることを示し、CoRの有効性を確認している。さらに、モデル注意マップの解析により、推論ステップを通して焦点が動的にシフトする解釈可能な推論過程が明らかにされる。本研究は,VLMを用いたロボット制御の精度,信頼性,解釈性を高めるために,テキストCoRを生成するためのVLMのトレーニングが効果的かつ堅牢な戦略であることを示す。私たちのデータセットとコードはhttps://github.com/jink-ucla/TRACEで公開されています。

論文の概要: TRACE: Textual Reasoning for Affordance Coordinate Extraction

関連論文リスト