Fugu-MT 論文翻訳(概要): EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

論文の概要: EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

arxiv url: http://arxiv.org/abs/2605.21862v1
Date: Thu, 21 May 2026 01:19:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.048741
Title: EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
Title（参考訳）: EvoScene-VLA: チャンクロボット制御のためのアクションデコーダ内におけるシーン信念の進化
Authors: Chushan Zhang, Ruihan Lu, Jinguang Tong, Xuesong Li, Yikai Wang, Hongdong Li,
Abstract要約: チャンクされた視覚言語アクション(VLA)ポリシーは、現在の視覚観察のみに各更新を条件付け、マルチステップロボット制御を予測する。 EvoScene-VLAを導入するために,制御コール間の永続的なアクション更新シーン状態について論じる。 31のRoboTwinタスクでは、EvoScene-VLAは、固定評価で87.2%から89.1%、ランダム評価で86.1%から88.5%に平均的な成功を上げた。
参考スコア（独自算出の注目度）: 44.33368130694432
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.
Abstract（参考訳）: チャンクされた視覚言語アクション(VLA)ポリシーは、現在の視覚観察のみに各更新を条件付け、マルチステップロボット制御を予測する。しかし、ロボットの動作は接触、閉塞、物体の動きを引き起こすため、次の視覚的アップデートが到着する前に、後の決定に依存する幾何学が変わる可能性がある。空間VLAは現在のフレーム形状を改善する。時間VLAは過去のフレームを集約します。チャンクをまたいだアクション更新シーンも維持できない。 EvoScene-VLAを導入するために,制御コール間の永続的なアクション更新シーン状態について論じる。繰り返し発生するシーンプレフィックスは、チャンクにまたがって幾何学的に認識されたシーン状態を保持する。各視覚言語モデル(VLM)コールにおいて、VLMは、現在の観察からのシーン情報と前のチャンクからのアクション更新前のアクション情報とを結合し、アクションデコーダは次のアクションチャンクとコンパクトなシーン更新の両方を出力する。この更新は次の前のものとなり、VLMは次の呼び出しが到着したときに新しい観察に対して修正する。したがって、各コントロールコールは、最近のアクションと新鮮な視覚的証拠の両方を反映する前のシーンから始まる。トレーニング中、 \textbf{Scene Predictor} は将来のシーントーケンターゲットを提供し、Geometric Anchor はシーンスロットを凍結深度と3D教師で調整する。デプロイ時に両方のモジュールを破棄します。 31のRoboTwinタスクでは、EvoScene-VLAは、固定評価で87.2%から89.1%、ランダム評価で86.1%から88.5%に平均的な成功を上げた。 Galaxea R1-Liteの本物のロボットでは、EvoScene-VLAはすべてのベースラインを上回ります。

論文の概要: EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

関連論文リスト