Fugu-MT 論文翻訳(概要): GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

論文の概要: GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

arxiv url: http://arxiv.org/abs/2605.12957v1
Date: Wed, 13 May 2026 03:43:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.793358
Title: GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
Title（参考訳）: GTA:画像から3Dのワールドジェネレーションを幾何学的手法で改善し、映像の拡散を促進
Authors: Hanxin Zhu, Cong Wang, Peiyan Tu, Jiayi Luo, Tianyu He, Xin Jin, Zhibo Chen,
Abstract要約: GTAは幾何学的手法に基づく新しい画像から3次元世界生成手法である。具体的には、単一の入力画像が与えられた場合、GTAは2つの専用ビデオ拡散モデルを持つ2段階のフレームワークを採用する。広汎な実験により,提案手法は忠実度,視覚的品質,幾何学的精度で既存手法より一貫して優れていた。
参考スコア（独自算出の注目度）: 29.999238067855245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: https://hanxinzhu-lab.github.io/GTA/.
Abstract（参考訳）: 生成モデルと大規模データセットの最近の発展は、空間知能、エンボディインテリジェンス、自律運転など幅広い領域を容易にし、3Dワールドジェネレーションを大幅に進歩させてきた。既存の3Dワールドジェネレーションへのアプローチは、目覚ましい進歩を遂げつつも、基礎となる幾何学の限られたモデリングによって外観予測を優先し、信頼できないシーン構造推定や劣化したクロスビュー整合性といった問題を引き起こす。人間の視覚知覚の粗大な性質に動機づけられたこれらの制約に対処するため,幾何学的・視覚的パラダイムに従う新しい画像から3D世界生成法であるGTAを提案する。具体的には、合成された3Dシーンの構造忠実度を改善するために、GTAは、2つの専用ビデオ拡散モデルを持つ2段階のフレームワークを採用し、まず、新しい視点から粗い幾何学構造を生成し、次に予測された幾何学に基づく微細な外観条件を合成する。クロスビューな外観の整合性をさらに向上するため、トレーニングプロセス中にランダムな潜伏シャッフル戦略を導入するとともに、定量的性能を損なうことなく知覚品質を向上させるテストタイムスケーリングスキームを導入する。広汎な実験により,提案手法は忠実度,視覚的品質,幾何学的精度で既存手法より一貫して優れていた。さらに、GTAは、既存の画像から3Dの世界パイプラインの生成品質をさらに向上し、複数のダウンストリームアプリケーションをサポートし、モデルのトレーニング中に良好なデータ効率を示し、その汎用性と幅広い適用性を強調した汎用的な拡張モジュールとして有効であることが示されている。プロジェクトページ: https://hanxinzhu-lab.github.io/GTA/。

論文の概要: GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

関連論文リスト