Fugu-MT 論文翻訳(概要): Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

論文の概要: Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.15618v1
Date: Mon, 16 Mar 2026 17:59:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.729417
Title: Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models
Title（参考訳）: アクト前に見る:ビジョン・ランゲージ・アクション・モデルのためのビジョン・ファウンデーションの表現を強化する
Authors: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang,
Abstract要約: VLA(Vision-Language-Action)モデルは、ロボット操作のための有望なパラダイムとして登場した。我々はtextbfVision-Language Mixture-of-Transformers (VL-MoT) フレームワーク上に構築した textbfDeepVision-VLA を提案する。 DeepVision-VLAは、シミュレーションされたタスクと実世界のタスクで、それぞれ9.0%と7.5%の先行の最先端メソッドより優れている。
参考スコア（独自算出の注目度）: 66.96421290733126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは最近、ロボット操作のための有望なパラダイムとして登場し、信頼性の高い行動予測は、言語命令に規定された視覚的観察を正確に解釈し統合することに依存している。近年の研究では、VLAモデルの視覚能力の向上が試みられているが、ほとんどのアプローチでは、LLMバックボーンをブラックボックスとして扱い、視覚情報のアクション生成への基盤に関する限られた洞察を与えている。そこで我々は,異なるアクション生成パラダイムをまたいだ複数のVLAモデルの体系的解析を行い,アクション生成において,視覚トークンに対する感度がより深い層で徐々に低下していくことを観察した。本研究の目的は, 変圧器(VL-MoT) フレームワーク上に構築した \textbf{DeepVision-VLA} を提案することである。このフレームワークは、ビジョンファンデーションモデルとVLAバックボーン間の共通注意を可能にし、視覚専門家からVLAバックボーンの深い層に多層的な視覚的特徴を注入することで、正確で複雑な操作のための視覚的表現を強化する。さらに,タスク関連トークンの保存や,計算オーバーヘッドの最小化による操作のための重要な視覚的手がかりの強化を図りながら,暗黙の注意を不適切な視覚的トークンに役立てる,AGVP(textbf{Action-Guided Visual Pruning)を導入する。 DeepVision-VLAは、シミュレーションされたタスクと実世界のタスクにおいて、9.0\%と7.5\%の先行した最先端の手法より優れており、視覚的に強化されたVLAモデルの設計に対する新たな洞察を提供する。

論文の概要: Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

関連論文リスト