Fugu-MT 論文翻訳(概要): Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

論文の概要: Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

arxiv url: http://arxiv.org/abs/2604.11331v1
Date: Mon, 13 Apr 2026 11:32:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.503059
Title: Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
Title（参考訳）: 3Dのシーンは1KのTokens: スケールでのシーン生成のための3Dグラウンド表現
Authors: Dongxu Wei, Qi Xu, Zhiqi Li, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Zhaopeng Cui, Peidong Liu,
Abstract要約: 3Dシーン生成は、長い間、2Dのマルチビューまたはビデオ拡散モデルによって支配されてきた。本稿では,暗黙の3次元潜在空間内で直接3Dシーンを生成することを提案する。
参考スコア（独自算出の注目度）: 47.551405833477986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views--at any resolution and aspect ratio--with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.
Abstract（参考訳）: 3Dシーン生成は、長い間、2Dのマルチビューまたはビデオ拡散モデルによって支配されてきた。これは、シーンレベルの3D潜在表現が欠如しているだけでなく、シーンレベルの視覚データが、自然に2D拡散アーキテクチャと互換性のあるマルチビュー画像やビデオの形で存在するという事実による。通常、これらの2次元アプローチは3次元空間外挿を2次元時間拡張に分解し、2つの根本的な問題を提起する。 (i)2次元ビューによる3次元シーンの表現は、顕著な表現冗長性をもたらし、 (II) 2次元に根付いた潜伏空間は、生成した3次元シーンの空間的一貫性を本質的に制限する。本稿では,これらの制約に対処するため,暗黙的な3次元ラテント空間内で3次元シーン生成を初めて行うことを提案する。まず,凍結した2次元表現エンコーダを再利用して3次元表現オートエンコーダ(DRAE)を構築する。これにより、任意の数のビューから観察される3Dシーン - 任意の解像度とアスペクト比 - 固定された複雑さとリッチなセマンティクス - を表現できる。次に,3D拡散変換器(DDiT)を導入し,この3D潜伏空間での拡散モデリングを行い,多種多様な条件設定をサポートしながら,極めて効率的で空間的に一貫した3Dシーン生成を実現する。さらに,本手法は直接3次元シーン表現を生成するため,任意のカメラ軌道に沿った画像や任意の点マップにデコードすることができる。

論文の概要: Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

関連論文リスト