Fugu-MT 論文翻訳(概要): GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

論文の概要: GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

arxiv url: http://arxiv.org/abs/2605.12369v1
Date: Tue, 12 May 2026 16:38:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:57.020093
Title: GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
Title（参考訳）: GuidedVLA: Plug-and-Play Action Attention Specializationによるタスク関連要因の特定
Authors: Xiaosong Jia, Bowen Yang, Zuhao Ge, Xian Nie, Yuchen Zhou, Cunxin Fan, Yufeng Li, Yilin Chai, Chao Jing, Zijian Liang, Qingwen Bu, Haidong Cao, Chao Wu, Qifeng Li, Zhenjie Yang, Chenhe Zhang, Hongyang Li, Zuxuan Wu, Junchi Yan, Yu-Gang Jiang,
Abstract要約: 本稿では,タスク関連要因に着目したアクション生成を支援するフレームワークである GuidedVLA を紹介する。私たちの中核的な洞察は、アクションデコーダをモノリシックな学習者としてではなく、機能的なコンポーネントの集合として扱うことです。この結果から,アクションデコーダ学習を明示的に指導することが,より堅牢で汎用的なVLAモデルを構築する上で有望な方向であることが示唆された。
参考スコア（独自算出の注目度）: 101.37117235471709
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.
Abstract（参考訳）: Vision-Language-Action(VLA)モデルは、強力なVision-Language Model(VLM)内のモダリティとして行動を調整することで、一般的なロボット学習を目指している。既存のVLAは、タスク関連機能を暗黙的に学習するために、エンド・ツー・エンドの監視に依存している。しかし、明示的なガイダンスがなければ、これらのモデルは視覚的ショートカットや環境騒音のような刺激的な相関に過度に適合し、一般化を制限している。本稿では,タスク関連要因に着目したアクション生成を手動でガイドするフレームワークである GuidedVLA を紹介する。私たちの中核的な洞察は、アクションデコーダをモノリシックな学習者としてではなく、機能的なコンポーネントの集合として扱うことです。個々のアテンションヘッドは、手動で定義された補助信号によって制御され、異なる要因を捉える。最初の研究として、このパラダイムをオブジェクトグラウンド、空間幾何学、時間的スキルロジックの3つの特別なヘッドでインスタンス化する。シミュレーションと実ロボット実験を通じて、 GuidedVLAは、強力なVLAベースラインと比較して、ドメイン内設定とドメイン外設定の両方の成功率を改善する。最後に、これらの特殊要因の質がタスク性能と正に相関し、我々のメカニズムが疎結合で高品質な特徴をもたらすことを示す。この結果から,アクションデコーダ学習を明示的に指導することが,より堅牢で汎用的なVLAモデルを構築する上で有望な方向であることが示唆された。

論文の概要: GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

関連論文リスト