Fugu-MT 論文翻訳(概要): Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

論文の概要: Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

arxiv url: http://arxiv.org/abs/2606.03784v2
Date: Wed, 03 Jun 2026 08:29:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 17:40:41.639961
Title: Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation
Title（参考訳）: 一般化可能なロボットマニピュレーションのためのボディード・オブ・サートの再検討
Authors: Nan Sun, Yuan Zhang, Yongkun Yang, Wentao Zhao, Peiyan Li, Jun Guo, Wenxuan Song, Pengxiang Ding, Runze Suo, Yifei Su, Xin Xiao, Xinghang Li, Huaping Liu,
Abstract要約: CoT(Embodied chain-of- Thought)は、言語推論とロボット制御を橋渡しすることを目的としている。現在までに最大規模のCoTコーパスを構築しており,978,743軌道,226.3Mサンプル,2592.5時間ロボットデータで構成されている。
参考スコア（独自算出の注目度）: 24.465551417061494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data. Through extensive experiments, we find that effective embodied CoT should ground high-level semantic understanding into concrete action guidance, such as end-effector movement descriptions and image-space trajectories, while high-level reasoning alone brings only marginal gains. We further show that explicit CoT does not scale reliably when used as an autoregressive action prefix, as it suffers from compounding inference errors and unstable reasoning-action coupling. To address these limitations, we propose ERVLA, a VLA model that uses embodied CoT as representation-shaping supervision rather than mandatory test-time reasoning. ERVLA is trained with a reasoning-dropout strategy, enabling the model to absorb rich reasoning traces during training while predicting actions directly without CoT decoding during inference. This design improves scalability with increasing pre-training data and avoids autoregressive instability. ERVLA achieves state-of-the-art performance on LIBERO-Plus with an 86.9% success rate and reaches 53.2% success rate on VLABench, demonstrating strong out-of-distribution generalization. In real-robot experiments, ERVLA further outperforms competitive state-of-the-art baselines, especially on tasks requiring semantic disambiguation and long-horizon execution.
Abstract（参考訳）: CoT(Embodied chain-of- Thought)は、言語推論とロボット制御を橋渡しすることを目的としているが、その効果的な形態と統合戦略は未解明のままである。本稿では,視覚言語アクションモデル(VLA)を大規模に実装したCoTを再検討する。現在までに最大規模のCoTコーパスを構築しており,978,743軌道,226.3Mサンプル,2592.5時間ロボットデータで構成されている。広範にわたる実験により,実効的なCoTは,エンドエフェクタ運動の記述や画像空間の軌跡などの具体的な行動指導に高レベルな意味的理解を基盤とすべきであり,高レベルな推論だけでは限界的な利得しか得られないことがわかった。さらに,予測誤差と不安定な推論・動作結合に悩まされる自己回帰的行動プレフィックスとして使用する場合,明示的なCoTは確実にスケールしないことを示す。これらの制約に対処するため,実装されたCoTを強制的なテスト時間推論ではなく,表現形成の監視として利用するVLAモデルであるERVLAを提案する。 ERVLAは推論ドロップアウト戦略でトレーニングされており、モデルがトレーニング中にリッチな推論トレースを吸収し、推論中にCoTデコードなしで直接アクションを予測することができる。この設計では、事前トレーニングデータの増加によりスケーラビリティが向上し、自動回帰不安定を回避する。 ERVLAは、LIBERO-Plusの最先端のパフォーマンスを86.9%の成功率で達成し、VLABenchで53.2%の成功率に達し、配布外一般化の強さを示している。 ERVLAは実際のロボット実験において、特にセマンティックな曖昧さと長時間の水平実行を必要とするタスクにおいて、競争力のある最先端のベースラインよりも優れている。

論文の概要: Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

関連論文リスト