Fugu-MT 論文翻訳(概要): APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

論文の概要: APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

arxiv url: http://arxiv.org/abs/2606.12366v1
Date: Wed, 10 Jun 2026 17:34:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 14:04:58.450585
Title: APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies
Title（参考訳）: APT:Action Expert Pretrainingはビジョン・ランゲージ・アクション・ポリシーのインストラクション・ジェネレーションを改善する
Authors: Kechun Xu, Zhenjie Zhu, Anzhe Chen, Rong Xiong, Yue Wang,
Abstract要約: 本稿では,アクションエキスパートのPreTrainingを重視した2段階のトレーニング手法を提案する。ステージ1では、アクションエキスパートは、凍結したVLMから視覚アクションペアに先立ってVAとして事前訓練され、言語不均衡を回避します。ステージ2では、言語トークンはゲート融合機構を通じて注入され、VLMの機能を統合すると同時に、学習したビズモレータを事前に保存する。
参考スコア（独自算出の注目度）: 22.87409999086972
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the $π$ and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/
Abstract（参考訳）: VLA(Vision-Language-Action)モデルでは、トレーニング済みのVLM(Vision-Language Models)と継続的なアクションエキスパートを併用することで、強力な操作性能を実現しているが、アウト・オブ・ディストリビューション(OOD)言語命令への一般化は依然として不十分である。既知の課題は、VLAデータの構造的不均衡であり、言語は視覚的および行動的コンテンツよりもはるかに多様性が低く、ポリシーは視覚的ショートカットに傾向がある。離散アクション手法は視覚言語によるコトレーニングを通じてこれを緩和するが、連続アクションの専門家は、ランダムな初期化から始まり、不均衡なデータから完全に学習し、VLMを破損させ、言語能力の活用に失敗するノイズのある勾配を生み出すという、そのような保護を欠いている。本稿では、ベイズの観点から、このポリシーを言語に依存しないビジョン・アクション(VA)と言語条件付きVLAに分解し、アクションエキスパートのPreTrainingを強調する2段階のトレーニング手法であるAPTを提案する。ステージ1では、アクションエキスパートは、凍結したVLMから視覚アクションペアに先立ってVAとして事前訓練され、言語不均衡を回避します。ステージ2では、言語トークンはゲート融合機構を通じて注入され、VLMの機能を統合すると同時に、学習したビズモレータを事前に保存する。 APTは、$π$やGR00Tスタイルのアーキテクチャを含む、主流のVLAアーキテクチャに適用できる。総合的な実験により、APTは目に見えない命令や構成タスクに対して一貫した利得を達成できる。 Project Page: https://xukechun.github.io/papers/APT/

論文の概要: APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

関連論文リスト