Map2World: Segment Map Conditioned Text to 3D World Generation
Abstract Overview
Map2World is a text-conditioned 3D world generation framework that uses user-defined segment maps with arbitrary shapes and scales as spatial conditions, extending beyond the grid-based layouts of prior methods. The method builds on TRELLIS, a pretrained 3D asset generator, and introduces latent fusion across overlapping 3D windows in a shared latent space to enable world-scale generation with segment-specific text prompts. A detail enhancer network predicts finer latent representations for sub-cubes while conditioning on both the local coarse structure and adjacent regions to preserve continuity. The pipeline also includes scale-aware initial latent optimization via spectral parameterization and decoder fine-tuning for improved output quality. The framework is designed to achieve global-scale consistency and semantic alignment while generalizing across domains despite limited world-level training data.
Novelty
The paper's main novelty is enabling text-to-3D world generation conditioned on arbitrary-shaped segment maps, rather than the grid-constrained layouts used by prior approaches such as SynCity. It combines multi-window latent fusion in volumetric 3D with a latent-space detail enhancer that leverages pretrained asset-generation priors while preserving global scene coherence through conditioning on both truncated scene latents and adjacent cube latents.
Results
In quantitative evaluation using a GPT-based scoring protocol, Map2World outperforms SynCity on the proposed World Quality metric (7.76 vs. 7.25) and on average GPTscore across four criteria (7.93/10 vs. 7.48/10). Ablations demonstrate that spectral-domain parameterization stabilizes scale-control optimization, enabling convergence within five steps at high learning rates, and that the concatenation-based detail enhancer without CFG achieves the best PSNR, LPIPS, and FID scores among tested design variants.
Key Points
- Map2World conditions 3D world generation on user-defined segment maps with arbitrary region shapes, sizes, and per-segment text prompts, unlike prior grid-constrained methods.
- The method uses latent fusion over overlapping 3D windows with Gaussian-weighted velocity aggregation and spectral-domain parameterization of initial noise to improve global coherence and scale consistency.
- A latent-space detail enhancer, trained by fine-tuning only an MLP layer prepended to the frozen TRELLIS flow Transformer, improves local detail while conditioning on both coarse scene structure and adjacent cube latents to maintain inter-region continuity.