Fugu-MT 論文翻訳(概要): How Do VLAs Effectively Inherit from VLMs?

論文の概要: How Do VLAs Effectively Inherit from VLMs?

arxiv url: http://arxiv.org/abs/2511.06619v1
Date: Mon, 10 Nov 2025 01:58:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-11 21:18:45.027328
Title: How Do VLAs Effectively Inherit from VLMs?
Title（参考訳）: VLMからVLAを効果的に継承する方法
Authors: Chuheng Zhang, Rushuai Yang, Xiaoyu Chen, Kaixin Wang, Li Zhao, Yi Chen, Jiang Bian,
Abstract要約: 視覚言語アクション(VLA)モデルは、一般化可能な具体化制御を実現するという約束を持っている。我々は、絵文字テーブルトップ操作タスクであるGrinningFaceという診断ベンチマークを導入する。本稿では,パラメータ効率のよい微調整,VLM凍結,協調学習,離散化動作の予測,潜伏動作の予測の効果について検討する。
参考スコア（独自算出の注目度）: 28.72002932514493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied control. We implement this diagnostic task in both simulated environment and a real robot, and compare various promising techniques for knowledge transfer. Specifically, we investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions. Through systematic evaluation, our work not only demonstrates the critical importance of preserving VLM priors for the generalization of VLA but also establishes guidelines for future research in developing truly generalizable embodied AI systems.
Abstract（参考訳）: 視覚言語アクション(VLA)モデルは、一般化可能な具体化制御を実現するという約束を持っている。これを実現するために、広範にわたるパラダイムは、大きな視覚言語モデル (VLM) のリッチなビジョン・セマンティックな先行性を活用することである。 VLAはどのようにしてVLMから以前の知識を継承するのか? この重要な問題に対処するために、ロボットアームが言語命令に対応する印刷絵文字にオブジェクトを配置するよう求める絵文字テーブルトップ操作タスクであるGrinningFaceという診断ベンチマークを導入する。絵文字に関連する知識は、VLM事前トレーニングに使用されるインターネットスケールのデータセットで広く使われているが、絵文字自体が標準のロボティクスデータセットにはほとんど欠落している。その結果、彼らはクリーンなプロキシを提供する: 正常なタスク完了は、具体化された制御に先立ってVLMを効果的に転送することを示している。シミュレーション環境と実ロボットの両方において,この診断タスクを実装し,知識伝達のための様々な有望な手法を比較した。具体的には,パラメータ効率のよい微調整,VLM凍結,協調学習,離散化動作の予測,潜時動作の予測の効果について検討する。体系的な評価を通じて,本研究は,VLAの一般化のためのVLM事前保存の重要性だけでなく,真に一般化可能なAIシステムの開発における今後の研究ガイドラインの確立にも寄与する。

論文の概要: How Do VLAs Effectively Inherit from VLMs?

関連論文リスト