Fugu-MT 論文翻訳(概要): Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

論文の概要: Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

arxiv url: http://arxiv.org/abs/2602.09722v1
Date: Tue, 10 Feb 2026 12:25:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.323325
Title: Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
Title（参考訳）: Visual-Language-Action Model Scalingの再考:アライメント、ミックス、正規化
Authors: Ye Wang, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Haoqi Yuan, Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, Zongqing Lu, Qin Jin,
Abstract要約: VLA(Vision-Language-Action)モデルは、ジェネラリストロボットの制御を強く約束する。標準的な「スケールデータ」レシピがロボット工学に翻訳されるかどうかはまだ不明だ。本稿では,多様なロボットを対象とした事前学習のためのコアトレーニング選択を再考する,VLAスケーリングの体系的かつ制御された研究を提案する。
参考スコア（独自算出の注目度）: 65.37179698521766
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Vision-Language-Action (VLA) models show strong promise for generalist robot control, it remains unclear whether -- and under what conditions -- the standard "scale data" recipe translates to robotics, where training data is inherently heterogeneous across embodiments, sensors, and action spaces. We present a systematic, controlled study of VLA scaling that revisits core training choices for pretraining across diverse robots. Using a representative VLA framework that combines a vision-language backbone with flow-matching, we ablate key design decisions under matched conditions and evaluate in extensive simulation and real-robot experiments. To improve the reliability of real-world results, we introduce a Grouped Blind Ensemble protocol that blinds operators to model identity and separates policy execution from outcome judgment, reducing experimenter bias. Our analysis targets three dimensions of VLA scaling. (1) Physical alignment: we show that a unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer. (2) Embodiment mixture: we find that naively pooling heterogeneous robot datasets often induces negative transfer rather than gains, underscoring the fragility of indiscriminate data scaling. (3) Training regularization: we observe that intuitive strategies, such as sensory dropout and multi-stage fine-tuning, do not consistently improve performance at scale. Together, this study challenge some common assumptions about embodied scaling and provide practical guidance for training large-scale VLA policies from diverse robotic data. Project website: https://research.beingbeyond.com/rethink_vla
Abstract（参考訳）: Vision-Language-Action(VLA)モデルは汎用的なロボット制御に強い期待を示しているが、標準的な「スケールデータ」レシピがロボティクスに変換されるかどうかは不明だ。本稿では,多様なロボットを対象とした事前学習のためのコアトレーニング選択を再考する,VLAスケーリングの体系的かつ制御された研究を提案する。視覚言語バックボーンとフローマッチングを組み合わせた代表的VLAフレームワークを用いて,マッチング条件下での重要な設計決定を補正し,広範囲なシミュレーションや実ロボット実験で評価する。実世界の結果の信頼性を向上させるために,演算子に同一性をモデル化させるグループブラインド・アンサンブルプロトコルを導入し,結果判断からポリシー実行を分離し,実験者のバイアスを低減する。 VLAスケーリングの3次元を対象とする。 1) 物理的アライメント: 統合エンドエフェクタ(EEF)の相対的動作表現は、堅牢なクロスエボディメント伝達に重要であることを示す。 2)不均質なロボットデータセットをネーティブにプールすると、利得よりも負の移動が引き起こされることが多く、不差別なデータのスケーリングの脆弱さが浮き彫りになる。 (3) 学習規則化: 感覚的ドロップアウトや多段階微調整といった直感的な戦略は, 常に性能を向上しない。本研究では,多様なロボットデータから大規模VLAポリシーをトレーニングするための実践的ガイダンスを提供するとともに,スケーリングの具体化に関するいくつかの一般的な仮定に挑戦する。プロジェクトウェブサイト:https://research.beingbeyond.com/rethink_vla

論文の概要: Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

関連論文リスト