Fugu-MT 論文翻訳(概要): POMA-3D: The Point Map Way to 3D Scene Understanding

論文の概要: POMA-3D: The Point Map Way to 3D Scene Understanding

arxiv url: http://arxiv.org/abs/2511.16567v1
Date: Thu, 20 Nov 2025 17:22:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-21 17:08:52.753762
Title: POMA-3D: The Point Map Way to 3D Scene Understanding
Title（参考訳）: POMA-3D:3Dシーン理解のためのポイントマップ
Authors: Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk,
Abstract要約: ポイントマップは、構造化された2Dグリッド上の明示的な3D座標を符号化する。リッチな2DプリエントをPOMA-3Dに転送するために、ビュー・ツー・シーンアライメント戦略が設計されている。統合埋め込み予測アーキテクチャであるPOMA-JEPAは、幾何的に一貫した点マップ機能を複数のビューで実行している。
参考スコア（独自算出の注目度）: 20.492325896478555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/
Abstract（参考訳）: 本稿では,ポイントマップから学習した最初の自己教師型3D表現モデルであるPOMA-3Dを紹介する。ポイントマップは、構造化された2Dグリッド上の明示的な3D座標を符号化し、グローバルな3D幾何学を保存すると同時に、2D基礎モデルの入力形式と互換性を保つ。リッチな2DプリエントをPOMA-3Dに転送するために、ビュー・ツー・シーンアライメント戦略が設計されている。さらに、点マップが標準空間に対してビューに依存しているため、複数のビューにまたがる幾何学的に一貫した点マップ機能を実装する統合埋め込み予測アーキテクチャであるPOMA-JEPAを導入する。さらに,6.5K部屋レベルのRGB-Dシーンと,大規模POMA-3D事前学習を容易にする1M 2D画像シーンで構成されたポイントマップデータセットであるScenePointを紹介した。実験により、POMA-3Dは専門家とジェネラリストの両方の3D理解の強力なバックボーンとして機能していることが示された。 3D質問応答、エンボディドナビゲーション、シーン検索、エンボディドローカライゼーションといった多様なタスクを、幾何学的入力(つまり3D座標)のみで実現している。総合的にPOMA-3Dは3次元シーン理解のためのポイントマップを探索し,事前学習の不足と3次元表現学習における限られたデータの不足に対処する。 Project Page: https://matchlab-imperial.github.io/poma3d/

論文の概要: POMA-3D: The Point Map Way to 3D Scene Understanding

関連論文リスト