Fugu-MT 論文翻訳(概要): Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

論文の概要: Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

arxiv url: http://arxiv.org/abs/2602.12281v1
Date: Thu, 12 Feb 2026 18:59:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-13 21:07:26.003979
Title: Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Title（参考訳）: ビジョンランゲージ・アクションアライメントのためのポリシー学習のスケーリングよりも、検証のスケーリングが効果的である
Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone,
Abstract要約: 視覚-言語-行動アライメントのためのコントラスト検証器を提案する。我々のフレームワークはビジョン・ランゲージ・モデルから多種多様な説明文をプリコンプリートする。各命令に対して繰り返しアクション候補を生成し、検証器を使用して最適なハイレベルプロンプトと低レベルアクションチャンクを選択する。
参考スコア（独自算出の注目度）: 58.93227458806748
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
Abstract（参考訳）: 汎用ロボットの長年のビジョンは、自然言語の指示を理解し、行動する能力に依存している。 VLA(Vision-Language-Action)モデルは、この目標に向けて顕著な進歩を遂げているが、生成されたアクションは、与えられた命令といまだに不一致である。本稿では,まず,「意図-行動ギャップ」を縮小する手段として,テスト時検証について検討する。我々はまず,文言命令の数と生成された動作を共同でスケーリングするテスト時スケーリング法を特徴付けるとともに,テスト時サンプルの多様性を著しく向上させ,各次元を独立にスケーリングするよりも効率的に正しい動作を回復させることがしばしばあることを実証する。このスケーリング法則を活かすために,視覚-言語-行動アライメントのためのコントラッシブ検証器であるCoVerを紹介し,我々のアーキテクチャが,付加的な計算資源とデータとともに適切にスケールすることを示し,次に,VLAのための階層的検証パイプラインを導入する。デプロイメントでは,VLM(Vision-Language-Model)から多種多様な命令をプリコンプリートし,各命令に対して繰り返しアクション候補を生成し,検証器を用いて最適なハイレベルプロンプトと低レベルアクションチャンクを選択する。同じデータで事前学習するスケーリングポリシと比較して、我々の検証アプローチでは、SIMPLERベンチマークで22%の配当と13%の配当が得られ、実際の実験ではさらに45%改善されている。 PolaRiSベンチマークでは、タスクの進捗が14%、成功率が9%に達した。

論文の概要: Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

関連論文リスト