Fugu-MT 論文翻訳(概要): Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

論文の概要: Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

arxiv url: http://arxiv.org/abs/2603.14811v1
Date: Mon, 16 Mar 2026 04:27:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:36.051603
Title: Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning
Title（参考訳）: Ego to World:強化学習による身体システムにおける協調的空間推論
Authors: Heng Zhou, Li Kang, Yiran Qin, Xiufeng Song, Ao Yu, Zilu Zhang, Haoming Song, Kaixin Xu, Yuchen Fan, Dongzhan Zhou, Xiaohong Liu, Ruimao Zhang, Philip Torr, Lei Bai, Zhenfei Yin,
Abstract要約: 本稿では,3つのタスクにまたがる異種視点を融合する視覚言語モデルの能力を評価するEgo-to-Worldベンチマークを提案する。我々は,2段階のフレームワークであるCoRLを提案し,チェイン・オブ・ソート(Chain-of-Thought)を教師付き微調整と強化学習を組み合わせた。我々は、CoRLが、推論と知覚グラウンドのメトリクスの両方において、強力なプロプライエタリおよびオープンソースベースラインを一貫して超越していることを示します。
参考スコア（独自算出の注目度）: 61.753025885751036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.
Abstract（参考訳）: 分散された部分的な視点から世界を理解することは、マルチエージェントシステムを具現化する上での根本的な課題である。それぞれのエージェントはエゴ中心の視点を通じて環境を知覚するが、それはしばしば隠蔽と曖昧さによって制限される。この問題を解決するために,3つのタスクにまたがる異種視点を融合する視覚言語モデルの能力を評価するEgo-to-World(E2W)ベンチマークを導入する。 (i)グローバルカウント (二)関係位置推論、及び三ビュー固有の画像座標の予測を必要とするアクション指向の把握。そこで本研究では,CoRLを提案する。この2段階のフレームワークは,Chain-of-Thoughtの教師付き微調整と,グループ相対的ポリシー最適化を用いた強化学習を組み合わせたものだ。その中核となるコンポーネントであるCross-View Spatial Reward (CVSR)は、推論ステップを視覚的エビデンスにリンクし、コヒーレントなクロスビューエンティティ解決を確保し、最終的な予測に向けてモデルを導くことによって、密集したタスク整合性フィードバックを提供する。 E2Wの実験では、CoRLは推論と知覚グラウンドの測定の両方において、強力なプロプライエタリなベースラインとオープンソースベースラインを一貫して上回り、Ablationsは各CVSRコンポーネントの必要性をさらに確認している。さらに、CoRLは外部空間推論ベンチマークに一般化し、キャリブレーションされたマルチカメラリグによる効果的な実世界のマルチロボット操作を可能にする。 E2WとCoRLは共に、分散されたエゴ中心の観察から世界中心のシーン理解を学ぶための原則的な基盤を提供する。

論文の概要: Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

関連論文リスト