Fugu-MT 論文翻訳(概要): Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

論文の概要: Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

arxiv url: http://arxiv.org/abs/2509.02055v1
Date: Tue, 02 Sep 2025 07:51:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.951187
Title: Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
Title（参考訳）: Align-Then-stEer:Unified Latent Guidanceによる視覚言語行動モデルへの適応
Authors: Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li,
Abstract要約: textbfAlign-Then-stEer(textttATE)は,新しいデータ効率,プラグアンドプレイ適応フレームワークである。我々の研究は、新しいロボットプラットフォームやタスクにVLAモデルをデプロイする実用性を大幅に向上させる、汎用的で軽量なソリューションを提供する。
参考スコア（独自算出の注目度）: 63.33213516925946
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce \textbf{Align-Then-stEer (\texttt{ATE})}, a novel, data-efficient, and plug-and-play adaptation framework. \texttt{ATE} first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to \textbf{9.8\%} in simulation and achieves a striking \textbf{32\% success rate gain} in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
Abstract（参考訳）: 大規模で多様なデータセットで事前訓練されたビジョンランゲージ・アクション(VLA)モデルは、汎用的なロボット操作に顕著な可能性を示している。しかし、特にロボットの実施形態やタスク自体が事前学習データと異なる場合、これらのモデルを下流タスクに適応する際の主要なボトルネックが残っている。この不一致は、アクションの分布に大きなミスマッチをもたらし、大規模なデータと効率的な微調整のための計算を必要とする。この課題に対処するために、新しい、データ効率、プラグアンドプレイ適応フレームワークである \textbf{Align-Then-stEer (\texttt{ATE})} を紹介します。ここでは、逆KL発散によって制約された変分オートエンコーダが、事前学習された動作潜在分布のモードに適応アクションを埋め込む。その後、モデル出力分布を対象領域に向けてプッシュする誘導機構を介して、微調整中に拡散またはフローベースのVLAの生成プロセスを操縦する。我々は,シミュレーションと実世界の両方において,クロス・エボデーメントとクロス・タスク操作に関する広範な実験を行う。代表VLAの直接微調整と比較して,シミュレーションにおける平均マルチタスク成功率を,実世界のクロス・エボディメント・セッティングにおいて,最大 \textbf{9.8\%} まで向上し,顕著な \textbf{32\%成功率ゲインを達成する。我々の研究は、新しいロボットプラットフォームやタスクにVLAモデルをデプロイする実用性を大幅に向上させる、汎用的で軽量なソリューションを提供する。

論文の概要: Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

関連論文リスト