Fugu-MT 論文翻訳(概要): OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

論文の概要: OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

arxiv url: http://arxiv.org/abs/2603.16099v1
Date: Tue, 17 Mar 2026 03:43:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.092114
Title: OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder
Title（参考訳）: OneWorld:3D統一表現オートエンコーダによるテイミングシーン生成
Authors: Sensen Gao, Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang, Tongliang Liu, Mingming Gong, Jiawang Bian,
Abstract要約: 本研究では,コヒーレントな3次元表現空間内で直接拡散を行うOneWorldを提案する。 OneWorldは、最先端の2Dベースの方法と比較して、クロスビューの一貫性に優れた高品質な3Dシーンを生成する。
参考スコア（独自算出の注目度）: 90.8453349494245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.
Abstract（参考訳）: 既存の拡散に基づく3Dシーン生成手法は主に2次元画像/ビデオの潜時空間で動作し、視界横断的な外観と幾何的一貫性を本質的に困難にしている。このギャップを埋めるために,コヒーレントな3次元表現空間内で直接拡散を行うOneWorldを提案する。我々のアプローチの中心は、3D統一表現オートエンコーダ(3D-URAE)であり、事前訓練された3D基礎モデルを活用し、外観を注入し、セマンティクスを3D潜在空間に蒸留することによって幾何学中心性を高める。さらに,トークンレベルのクロス・ビュー・対応性 (CVC) の整合性を損なうことにより,ビュー間の構造的整合性を明示し,列車の干渉露光バイアスを軽減するためにマニフォールド・ドリフト・フォース (MDF) を提案し,ドリフト表現とオリジナル表現を混合してロバストな3次元多様体を形成する。総合的な実験により、OneWorldは最先端の2Dベースの手法と比較して、クロスビューの一貫性に優れた高品質な3Dシーンを生成することが示された。私たちのコードはhttps://github.com/SensenGao/OneWorldで公開されます。

論文の概要: OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

関連論文リスト