Fugu-MT 論文翻訳(概要): Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

論文の概要: Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

arxiv url: http://arxiv.org/abs/2604.10573v1
Date: Sun, 12 Apr 2026 10:36:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.108402
Title: Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
Title（参考訳）: 複数視点画像から空間知能の3次元表現を学習する
Authors: Bo Zhou, Qiuxia Lai, Zeren Sun, Xiangbo Shu, Yazhou Yao, Wenguan Wang,
Abstract要約: UniSplat (UniSplat) は、未提示のマルチビュー画像から3D表現を学習するためのフィードフォワードフレームワークである。エンコーダにおける幾何誘導を強化するデュアルマスキング戦略を導入する。第2に,外見のセマンティックな矛盾を解消する粗大なガウス的スプレイティング戦略を開発する。第3に、予測された3次元点と意味マップを画像平面に相互に関連付ける、ポーズ条件の補正機構を導入する。
参考スコア（独自算出の注目度）: 81.94999489820974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.
Abstract（参考訳）: ロバストな3D表現学習は、空間知性の知覚基盤を形成し、シーン理解と体現されたAIにおける下流タスクを可能にする。しかし、そのような表現を未提示のマルチビュー画像から直接学習することは依然として困難である。最近の自己監督的手法は、幾何学、外観、意味論をフィードフォワード的に統一しようとするが、しばしば弱い幾何学の帰納、外観の詳細の制限、幾何学と意味論の矛盾に悩まされる。フィードフォワードフレームワークであるUniSplatを導入する。まず,エンコーダの幾何学的帰納性を高める二重マスキング方式を提案する。エンコーダトークンとデコーダトークンの両方をマスクし、デコーダマスクをジオメトリリッチな領域にターゲットすることで、モデルが不完全な視覚的手がかりから構造情報を推論し、未提示の入力でも幾何認識表現を出力せざるを得ない。第2に,ラディアンス場を漸進的に精製することにより,外見のセマンティックな矛盾を低減できる粗大なガウススプラッティング戦略を開発する。最後に,画像面に推定された3次元点と意味図を推定カメラパラメータを用いて再投影し,対応するRGBや意味予測と整合させて,対面整合性を確保することで,複数の頭部の出力を補間するポーズ条件の緩和機構を導入し,幾何学的ミスマッチを解消する。これらのコンポーネントは同時に、未提示でスパースビューな入力に対して堅牢な統一された3D表現をもたらし、様々なタスクにまたがって一般化し、空間知性の知覚的基盤を築き上げている。

論文の概要: Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

関連論文リスト