Fugu-MT 論文翻訳(概要): FocusVLA: Focused Visual Utilization for Vision-Language-Action Models

論文の概要: FocusVLA: Focused Visual Utilization for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.28740v1
Date: Mon, 30 Mar 2026 17:50:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.550567
Title: FocusVLA: Focused Visual Utilization for Vision-Language-Action Models
Title（参考訳）: FocusVLA:ビジョン・ランゲージ・アクションモデルのための視覚的利用
Authors: Yichi Zhang, Weihao Yuan, Yizhuo Zhang, Xidong Zhang, Jia Wan,
Abstract要約: VLA(Vision-Language-Action)モデルは、リッチビジョン言語情報に対する条件付けによってアクション生成を改善する。 FocusVLAは,タスク関連視覚領域にモデルの注意を向け,視覚を効果的に行動にブリッジする新しいパラダイムである。
参考スコア（独自算出の注目度）: 12.859683124954339
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models improve action generation by conditioning policies on rich vision-language information. However, current auto-regressive policies are constrained by three bottlenecks: (1) architectural bias drives models to overlook visual details, (2) an excessive number of visual tokens makes attention difficult to focus on the correct regions, and (3) task-irrelevant visual information introduces substantial noise - together severely impairing the quality of action. In this paper, we investigate how to effectively utilize different visual representations for action generation. To this end, we first empirically validate the above issues and show that VLA performance is primarily limited by how visual information is utilized, rather than by the quality of visual representations. Based on these insights, we introduce FocusVLA, a novel paradigm that directs the model's attention to task-relevant visual regions to effectively bridge vision to action. Specifically, we first propose Modality Cascaded Attention to eliminate shortcut pathways, thereby compelling VLA models to rely on task-relevant visual details for action generation. Furthermore, we propose Focus Attention, which dynamically selects task-relevant visual patches to control information quantity while explicitly modulating their influence to suppress task-irrelevant noise. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that FocusVLA not only effectively leverages visual details to perform dexterous manipulations, but also substantially improves performance and accelerates convergence across a variety of tasks.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、リッチビジョン言語情報に対する条件付けによってアクション生成を改善する。しかし、現在の自己回帰政策は、3つのボトルネックによって制約されている。(1) アーキテクチャバイアスがモデルに視覚的詳細を見落としさせ、(2) 過度な視覚的トークンの数が、適切な領域に注意を集中させることを難しくし、(3) タスク非関連視覚情報には重大なノイズが伴い、行動の質を著しく損なう。本稿では,アクション生成に異なる視覚表現を効果的に活用する方法を検討する。この目的のために、我々はまず上記の問題を実証的に検証し、VLAの性能は視覚情報の質よりも視覚情報の活用方法に制限されていることを示す。これらの知見に基づいて,タスク関連視覚領域へのモデルの注意を向け,視覚を効果的に行動にブリッジする新しいパラダイムであるFocusVLAを紹介する。具体的には,まずモダリティカスケード・アテンション(Modality Cascaded Attention)を提案する。さらに,タスク関連視覚パッチを動的に選択して情報量を制御し,その影響を明示的に調整し,タスク関連雑音を抑制するFocus Attentionを提案する。シミュレーションと実世界の両方のロボットベンチマークに関する大規模な実験は、FocusVLAが視覚的詳細を効果的に活用して巧妙な操作を行うだけでなく、パフォーマンスを大幅に改善し、さまざまなタスクの収束を加速することを示した。

論文の概要: FocusVLA: Focused Visual Utilization for Vision-Language-Action Models

関連論文リスト