Fugu-MT 論文翻訳(概要): LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

論文の概要: LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

arxiv url: http://arxiv.org/abs/2604.28192v1
Date: Thu, 30 Apr 2026 17:59:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.252346
Title: LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models
Title（参考訳）: LaST-R1:VLAモデルに対する適応的物理遅延推論による強化作用
Authors: Hao Chen, Jiaming Liu, Zhonghao Yan, Nuowei Han, Renrui Zhang, Chenyang Gu, Jialin Gao, Ziyu Guo, Siyuan Qian, Yinxi Wang, Peng Jia, Chi-Wing Fu, Shanghang Zhang, Pheng-Ann Heng,
Abstract要約: textbfLaST-R1は,動作実行前に物理力学を推論する潜在チェーン・オブ・ソート(CoT)を統合した統合VLAフレームワークである。 LAPOは物理世界モデリングの表現を改善し、対話環境における堅牢性を高める。 LaST-R1は、LIBEROベンチマークで99.8%の平均的な成功率を達成した。
参考スコア（独自算出の注目度）: 112.82269746694004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches share a critical limitation: whether employing explicit linguistic reasoning that suffers from latency and discretization, or utilizing more expressive continuous latent reasoning, they are predominantly confined to static imitation learning that limits adaptability and generalization. While online reinforcement learning (RL) has been introduced to VLAs to enable trial-and-error exploration, current methods exclusively optimize the vanilla action space, bypassing the underlying physical reasoning process. In this paper, we present \textbf{LaST-R1}, a unified VLA framework that integrates latent Chain-of-Thought (CoT) reasoning over physical dynamics prior to action execution, along with a tailored RL post-training paradigm. Specifically, we propose \textbf{Latent-to-Action Policy Optimization (LAPO)}, a novel RL algorithm that jointly optimizes the latent reasoning process and the action generation. By bridging reasoning and control, LAPO improves the representation of physical world modeling and enhances robustness in interactive environments. Furthermore, an \textbf{adaptive latent CoT mechanism} is introduced to allow the policy to dynamically adjust its reasoning horizon based on environment complexity. Extensive experiments show that LaST-R1 achieves a near-perfect 99.8\% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art methods. In real-world deployments, LAPO post-training yields up to a 44\% improvement over the initial warm-up policy across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルには、複雑なロボット操作のための推論機構が組み込まれている。しかし、既存のアプローチは、遅延と離散化に苦しむ明示的な言語推論を採用するか、あるいはより表現力のある連続的な潜在推論を利用するかという、重要な制限を共有している。オンライン強化学習(RL)は、試行錯誤探索を可能にするためにVLAに導入されているが、現在の手法はバニラアクション空間を最適化し、基礎となる物理的推論プロセスをバイパスしている。本稿では,動作実行に先立って物理力学を推論する潜在チェーン・オブ・ソート(CoT)を統合した統合VLAフレームワークであるtextbf{LaST-R1}と,RLポストトレーニングパラダイムを提案する。具体的には、潜在推論プロセスとアクション生成を協調的に最適化する新しいRLアルゴリズムである、LAPO(textbf{Latent-to-Action Policy Optimization)を提案する。推論と制御をブリッジすることで、LAPOは物理世界モデリングの表現を改善し、対話環境における堅牢性を高める。さらに、環境複雑性に基づいて推論水平線を動的に調整できるように、textbf{adaptive latent CoT mechanism}を導入する。大規模な実験により、LaST-R1はLIBEROベンチマークの平均成功率は99.8パーセント近くで、1ショットの監視されたウォームアップで達成され、従来の最先端手法よりもコンバージェンス速度と性能が大幅に向上した。現実のデプロイメントでは、LAPOのポストトレーニングは、シングルアームとデュアルアームの両方の設定を含む4つの複雑なタスクで、最初のウォームアップポリシーよりも最大44%改善されている。最後に、LaST-R1はシミュレーション環境と実世界の環境をまたいだ強力な一般化を示す。

論文の概要: LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

関連論文リスト