Fugu-MT 論文翻訳(概要): LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

論文の概要: LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

arxiv url: http://arxiv.org/abs/2604.28192v3
Date: Thu, 07 May 2026 14:00:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.284948
Title: LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
Title（参考訳）: LaST-R1:適応的物理遅延推論によるロボットマニピュレーションの強化
Authors: Hao Chen, Jiaming Liu, Zhonghao Yan, Nuowei Han, Renrui Zhang, Chenyang Gu, Jialin Gao, Ziyu Guo, Siyuan Qian, Yinxi Wang, Peng Jia, Shanghang Zhang, Pheng-Ann Heng,
Abstract要約: 提案するLaST-R1(LaST-R1)は,「最近の推論・行動」政策を活用するために設計された,新しい強化学習フレームワークである。 LaST-R1 は LIBERO ベンチマークで 99.9% の平均成功率を達成した。実世界の展開では、LaST-R1はSOTAが監督する微調整アプローチよりも22.5%平均的に改善されている。
参考スコア（独自算出の注目度）: 90.86828952599147
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robotic foundation models require reasoning over complex visual scenes to execute adaptive actions in dynamic environments. While recent studies on latent-reasoning Vision-Language-Action (VLA) models have demonstrated the capability to capture fine-grained physical dynamics, they remain predominantly confined to static imitation learning, severely limiting their adaptability and generalization. In this paper, we present LaST-R1, a novel reinforcement learning (RL) post-training framework designed to effectively harness "latent reasoning-before-acting" policies. Specifically, we propose Latent-to-Action Policy Optimization (LAPO), a core RL algorithm that jointly optimizes the latent reasoning process and the action generation. By explicitly embedding latent Chain-of-Thought (CoT) reasoning directly within the RL optimization loop, LAPO stimulates profound physical world modeling, which in turn drives robust execution in interactive environments. Furthermore, an adaptive latent CoT mechanism is introduced, allowing the policy to dynamically modulate its reasoning horizon based on diverse environment states. Experiments show that LaST-R1 achieves a near-perfect 99.9% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art (SOTA) methods. In real-world deployments, LaST-R1 yields up to a 22.5% average improvement over SOTA supervised fine-tuning approach across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.
Abstract（参考訳）: ロボット基礎モデルは、動的環境で適応的なアクションを実行するために複雑な視覚シーンを推論する必要がある。 VLAモデルに関する最近の研究は、細粒度の物理力学を捉える能力を示しているが、それらは主に静的模倣学習に限られており、適応性と一般化を著しく制限している。本稿では,新しい強化学習(RL)ポストトレーニングフレームワークであるLaST-R1について述べる。具体的には、遅延推論プロセスとアクション生成を協調的に最適化するコアRLアルゴリズムであるLatent-to-Action Policy Optimization (LAPO)を提案する。遅延Chain-of-Thought(CoT)推論をRL最適化ループに直接埋め込むことで、LAPOは深い物理世界モデリングを刺激し、対話的な環境での堅牢な実行を促進する。さらに、適応型潜在CoT機構を導入し、多様な環境状態に基づいて推論水平線を動的に変調する。実験の結果、LaST-R1はLIBEROベンチマークで平均99.9%の平均成功率を達成したが、これは1ショットの監視によるウォームアップのみで、従来のSOTA法よりもコンバージェンス速度と性能が大幅に向上した。現実世界での配備では、LaST-R1はSOTAが監督する4つの複雑なタスク(シングルアームとデュアルアームの両方を含む)にわたる微調整アプローチよりも22.5%平均的に改善されている。最後に、LaST-R1はシミュレーション環境と実世界の環境をまたいだ強力な一般化を示す。

論文の概要: LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

関連論文リスト