Fugu-MT 論文翻訳(概要): PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

論文の概要: PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

arxiv url: http://arxiv.org/abs/2605.13169v1
Date: Wed, 13 May 2026 08:31:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.914559
Title: PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
Title（参考訳）: パノラマ世界:360$^\circ$パノラマ世界における空間スーパーセンシングを目指して
Authors: Changpeng Wang, Xin Lin, Junhan Liu, Yuheng Liu, Zhen Wang, Donglian Qi, Yunfeng Yan, Xi Chen,
Abstract要約: 本研究では, MLLMが連続的, 観測中心空間としての正方形射影パノラマを推論するために必要となるパノネイティブ理解について検討する。球面形状を視覚ストリームに注入する球面空間交叉型パノワールドについて紹介する。実験によると、PanoWorldはPanoSpace-Bench、H* Bench、R2R-CE Val-Unseenベンチマークにおいて、プロプライエタリベースラインとオープンソースベースラインの両方を大幅に上回っている。
参考スコア（独自算出の注目度）: 20.19893789145407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.
Abstract（参考訳）: マルチモーダル大規模実験室モデル(MLLM)は、人間の知覚の狭い視野を継承する支配的な視点イメージパラダイムの下で、空間的理解に苦慮している。ナビゲーション、ロボット検索、そして3Dシーン理解のために、360度パノラマセンシングは、周囲の環境全体を一度に捉えることで、スーパーセンシングの形式を提供する。しかし、既存のMLLMパイプラインは通常、パノラマを複数の視点に分解し、等方射影(ERP)の球面構造をほとんど暗黙的に残している。本稿では,ERPパノラマを連続観測中心空間として解釈するためにMLLMが必要となるパノネイティブ理解について検討する。そこで我々はまず,意味的アンカー,球面局在化,参照フレーム変換,深度を考慮した3次元空間推論など,パノネイティブ理解の鍵となる能力を定義した。次に、混在するERPパノラマを幾何学的、言語的、深層的な監視に変換する大規模なメタデータ構築パイプラインを構築し、これらの信号を機能整合型チューニングデータとしてインスタンス化する。モデル側では,球面形状を視覚ストリームに注入する球面空間交叉型パノワールドを導入する。さらに、ERPネイティブな空間推論を評価するための診断ベンチマークであるPanoSpace-Benchを構築した。実験によると、PanoWorldはPanoSpace-Bench、H* Bench、R2R-CE Val-Unseenベンチマークにおいて、プロプライエタリベースラインとオープンソースベースラインの両方を大幅に上回っている。これらの結果は、ロバストなパノラマ推論には、パノラマの専門監督と幾何学的モデル適応が必要であることを示している。すべてのソースコードと提案されたデータは公開されます。

論文の概要: PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

関連論文リスト