Fugu-MT 論文翻訳(概要): VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

論文の概要: VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

arxiv url: http://arxiv.org/abs/2510.09607v1
Date: Fri, 10 Oct 2025 17:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:49.514279
Title: VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
Title（参考訳）: VITA-VLA:アクションエキスパート蒸留による視覚言語モデルの実行を効果的に教える
Authors: Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, Deqiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, Caifeng Shan,
Abstract要約: VLA(Vision-Language Action)モデルは、事前訓練された視覚言語モデル(VLM)の強い知覚能力を活用することにより、ロボット操作を著しく向上させる。本稿では,VLMに事前訓練された小規模な行動モデルから知識を伝達することで,動作実行能力を持たせる,簡易かつ効果的な蒸留ベースフレームワークを提案する。 5つの操作課題にわたる実世界の実験において,本手法は教師モデルより一貫して優れ,82.0%の成功率(17%改善)を達成した。
参考スコア（独自算出の注目度）: 76.13140980997508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
Abstract（参考訳）: ビジョンランゲージアクション(VLA)モデルは、事前訓練された視覚言語モデル(VLM)の強い知覚能力を活用することで、ロボット操作を著しく進歩させる。これらの事前訓練されたモデルにアクションモジュールを統合することで、VLA法は一般化が改善された。しかし、それらをゼロから訓練するのはコストがかかる。本研究では,VLMに予め訓練済みの小型アクションモデルから知識を伝達することで,動作実行能力を備えた簡易かつ効果的な蒸留ベースフレームワークを提案する。我々のアーキテクチャは元のVLM構造を保持しており、物理入力を組み込むためにアクショントークンとステートエンコーダのみを付加している。行動知識を抽出するために、我々は2段階の訓練戦略を採用する。まず、VLM隠蔽状態を小さなアクションモデルのアクション空間にマッピングすることで、その事前訓練されたアクションデコーダを効果的に再利用し、高価な事前訓練を避けることで、軽量なアライメントを行う。第2に、言語モデル、状態エンコーダ、アクションモジュールを選択的に微調整し、マルチモーダル入力と正確なアクション生成を統合する。具体的には、アクショントークンは、将来のアクションを予測するための直接ハンドルをVLMに提供し、ステートエンコーダは、ビジョンだけでキャプチャされていないロボットのダイナミクスをモデルに組み込むことができる。この設計により、大きなVLAモデルをスクラッチからトレーニングするよりも、かなりの効率が向上する。従来の最先端手法と比較して,LIBEROの平均成功率は97.3%,LIBERO-LONG平均成功率は93.5%(24.5%)である。実世界の5つの操作課題を対象とした実験において,本手法は教師モデルより一貫して優れ,82.0%の成功率(17%改善)を達成した。

論文の概要: VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

関連論文リスト