FuguReport

Map2World: Segment Map Conditioned Text to 3D World Generation

Authors Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang, Jiaolong Yang, Kyoung Mu Lee
Affiliations Seoul National University / Microsoft
Categories Method / 3D Generation / 3D world generation from segment maps, Application / Immersive Content Creation / 3D immersive content and simulation, Application / Autonomous Driving Simulation / Simulation using 3D generated worlds
License CC BY 4.0

Abstract Overview

Map2World is a text-conditioned 3D world generation framework that uses user-defined segment maps with arbitrary shapes and scales as spatial conditions, extending beyond the grid-based layouts of prior methods. The method builds on TRELLIS, a pretrained 3D asset generator, and introduces latent fusion across overlapping 3D windows in a shared latent space to enable world-scale generation with segment-specific text prompts. A detail enhancer network predicts finer latent representations for sub-cubes while conditioning on both the local coarse structure and adjacent regions to preserve continuity. The pipeline also includes scale-aware initial latent optimization via spectral parameterization and decoder fine-tuning for improved output quality. The framework is designed to achieve global-scale consistency and semantic alignment while generalizing across domains despite limited world-level training data.

Novelty

The paper's main novelty is enabling text-to-3D world generation conditioned on arbitrary-shaped segment maps, rather than the grid-constrained layouts used by prior approaches such as SynCity. It combines multi-window latent fusion in volumetric 3D with a latent-space detail enhancer that leverages pretrained asset-generation priors while preserving global scene coherence through conditioning on both truncated scene latents and adjacent cube latents.

Results

In quantitative evaluation using a GPT-based scoring protocol, Map2World outperforms SynCity on the proposed World Quality metric (7.76 vs. 7.25) and on average GPTscore across four criteria (7.93/10 vs. 7.48/10). Ablations demonstrate that spectral-domain parameterization stabilizes scale-control optimization, enabling convergence within five steps at high learning rates, and that the concatenation-based detail enhancer without CFG achieves the best PSNR, LPIPS, and FID scores among tested design variants.

Key Points

  1. Map2World conditions 3D world generation on user-defined segment maps with arbitrary region shapes, sizes, and per-segment text prompts, unlike prior grid-constrained methods.
  2. The method uses latent fusion over overlapping 3D windows with Gaussian-weighted velocity aggregation and spectral-domain parameterization of initial noise to improve global coherence and scale consistency.
  3. A latent-space detail enhancer, trained by fine-tuning only an MLP layer prepended to the frozen TRELLIS flow Transformer, improves local detail while conditioning on both coarse scene structure and adjacent cube latents to maintain inter-region continuity.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.