Fugu-MT 論文翻訳(概要): Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

論文の概要: Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

arxiv url: http://arxiv.org/abs/2512.02834v1
Date: Tue, 02 Dec 2025 14:42:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-03 21:04:45.931464
Title: Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
Title（参考訳）: アンチサーベイレーションとしてのステアリング・ビジョン・ランゲージ・アクションモデル:テスト時間スケーリングアプローチ
Authors: Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li,
Abstract要約: 動作チャンクの高忠実度検証に軽量な擬数推定器を適用したテスト時間スケーリングフレームワークである textbfTACO を提案する。我々の手法は、オフライン強化学習(RL)における古典的な反探索原理に似ており、勾配のないため、計算上の大きな恩恵をもたらす。
参考スコア（独自算出の注目度）: 78.4812458793128
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、フローマッチングや拡散目標を通じてトレーニングされた、大規模なマルチモーダルデータセット(例えば、人間の遠隔操作、スクリプトポリシ)からの複雑な振る舞いの学習に長けている。しかし、VLAは事前訓練段階に多様なデータモードを組み込んでおり、微調整データセットはキネマティックに最適あるいは望ましくない方法で収集されたデモデータを含むことが多いため、下流タスクの成功動作モードとは無関係な冗長なアクションモードが存在する。具体的には,事前学習したVLAの微調整を監督した後に,様々なサンプル雑音の臨界時間変動を観察する。本稿では、この不安定性は、VLAポリシーと下流タスクデータセットの安定的な成功モードによって引き起こされるポリシーとの分配シフトに起因している。そこで,テスト時間スケーリング(TTS)フレームワークである‘textbf{TACO} を提案する。 TACOと統合されたVLAモデルは、すべてのサンプリングされたアクションチャンクから最大擬似カウントでアクションを実行することができるため、制約が推論時にのみ適用されるため、VLAの一般化能力を保ちながら、分散シフトを防止できる。本手法は, オフライン強化学習(RL)における古典的反探索原理に類似しており, 勾配のないため, RL更新と比較して計算上の利点が顕著である。シミュレーションベンチマーク (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) とデュアルアームプラットフォームによる大規模な実験により, 下流タスク適応における推論安定性と成功率を大幅に向上することを示した。

論文の概要: Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

関連論文リスト