Fugu-MT 論文翻訳(概要): AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

論文の概要: AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.10126v1
Date: Tue, 10 Mar 2026 18:03:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.643458
Title: AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Title（参考訳）: AR-VLA:ビジョン・ランゲージ・アクションモデルのための真の自己回帰行動エキスパート
Authors: Yutong Hu, Jan-Nico Zaech, Nikolay Nikolov, Yuanqi Yao, Sombit Dey, Giuliano Albanese, Renaud Detry, Luc Van Gool, Danda Paudel,
Abstract要約: 本稿では、連続因果配列として行動を生成するスタンドアロンの自己回帰(AR)アクションエキスパートを提案する。我々の研究は、スケーラブルでコンテキスト対応のアクション生成スキーマを導入し、効果的なロボットポリシーをトレーニングするための堅牢な構造基盤を提供します。
参考スコア（独自算出の注目度）: 36.00004339916959
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.
Abstract（参考訳）: そこで本稿では,視覚言語プレフィックスをリフレッシュしながら連続因果シーケンスとしてアクションを生成する,スタンドアロンの自己回帰(AR)アクションエキスパートを提案する。既存のVision-Language-Action(VLA)モデルと,時間的コンテキストを新たな観測状況にリセットし,アクションを反応的に予測する拡散ポリシとは対照的に,Action Expertは長期記憶を通じて自身の履歴を維持し,本来はコンテキスト認識である。この構造は、高速制御と遅い推論の周波数ミスマッチに対処し、キネマティック構文の効率的な独立事前トレーニングと重知覚バックボーンとのモジュール統合を可能にし、フレーム間の時空間的一貫したアクション生成を自然に保証する。これら非同期ハイブリッドV-L-Aモダリティの同期化には、トレーニングと推論の両方において数学的に知覚の安定化を考慮に入れた再合成機構を用いる。シミュレーションおよび実ロボット操作に関する実験により、提案手法は、従来のチャンクベースのアクションヘッドを、専門家とジェネラリストの両方のポリシーに効果的に置き換えることができることを示した。 AR-VLAは、最先端の反応性VLAのタスク成功率を維持したり、超えたりしながら、より優れた歴史認識と、よりスムーズな行動軌跡を示す。全体として、我々の研究はスケーラブルでコンテキスト対応のアクション生成スキーマを導入し、効果的なロボットポリシーをトレーニングするための堅牢な構造基盤を提供します。

論文の概要: AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

関連論文リスト