Related papers: LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

URL: http://arxiv.org/abs/2509.05263v2
Date: Mon, 08 Sep 2025 17:05:47 GMT
Title: LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
Authors: Yinglin Duan, Zhengxia Zou, Tongwei Gu, Wei Jia, Zhan Zhao, Luyi Xu, Xinzhu Liu, Yenan Lin, Hao Jiang, Kang Chen, Shuang Qiu,
Abstract summary: We propose a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments.<n>LatticeWorld creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction.<n>LatticeWorld achieves over a $90times$ increase in industrial production efficiency.
Score: 35.4193352348583
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a $90\times$ increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18

Related papers

Beyond Pixel Histories: World Models with Persistent 3D State [50.4601060508243]
PERSIST is a new paradigm of world model which simulates the evolution of a latent 3D scene.<n>We show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods.
arXiv Detail & Related papers (2026-03-03T19:58:31Z)
Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z)
DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling [67.95038177144554]
We introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video.<n>We employ vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic captions.<n> DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos.
arXiv Detail & Related papers (2025-12-02T18:24:27Z)
WorldGen: From Text to Traversable and Interactive 3D Worlds [87.95088818329403]
We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts.<n>Our approach transforms natural language descriptions into fully textured environments that can be immediately explored or edited within standard game engines.<n>This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.
arXiv Detail & Related papers (2025-11-20T22:13:18Z)
NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding [46.79724166827757]
We introduce NeoWorld, a framework for generating interactive 3D virtual worlds from a single input image.<n>Inspired by the on-demand worldbuilding concept in the science fiction novel Simulacron-3 (1964), our system constructs expansive environments.
arXiv Detail & Related papers (2025-09-29T08:24:28Z)
Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation [87.91642226587294]
Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data.<n>We propose a self-distillation framework that distills the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation.<n>Our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.
arXiv Detail & Related papers (2025-09-23T17:58:01Z)
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels [30.986527559921335]
HunyuanWorld 1.0 is a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions.<n>Our approach features three key advantages: 1) 360deg immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity.
arXiv Detail & Related papers (2025-07-29T13:43:35Z)
Generative AI Framework for 3D Object Generation in Augmented Reality [0.0]
This thesis integrates state-of-the-art generative AI models for real-time creation of 3D objects in augmented reality (AR) environments.<n>The framework demonstrates applications across industries such as gaming, education, retail, and interior design.<n>A significant contribution is democratizing 3D model creation, making advanced AI tools accessible to a broader audience.
arXiv Detail & Related papers (2025-02-21T17:01:48Z)
UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI [37.47562766916571]
We introduce UnrealZoo, a collection of over 100 photo-realistic 3D virtual worlds built on Unreal Engine.<n>We also provide a rich variety of playable entities, including humans, animals, robots, and vehicles for embodied AI research.
arXiv Detail & Related papers (2024-12-30T14:31:01Z)
GenEx: Generating an Explorable World [59.0666303068111]
We introduce GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination.<n>GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image.<n> GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation.
arXiv Detail & Related papers (2024-12-12T18:59:57Z)
3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z)
Self-supervised novel 2D view synthesis of large-scale scenes with efficient multi-scale voxel carving [77.07589573960436]
We introduce an efficient multi-scale voxel carving method to generate novel views of real scenes. Our final high-resolution output is efficiently self-trained on data automatically generated by the voxel carving module. We demonstrate the effectiveness of our method on highly complex and large-scale scenes in real environments.
arXiv Detail & Related papers (2023-06-26T13:57:05Z)
GINA-3D: Learning to Generate Implicit Neural Assets in the Wild [38.51391650845503]
GINA-3D is a generative model that uses real-world driving data from camera and LiDAR sensors to create 3D implicit neural assets of diverse vehicles and pedestrians. We construct a large-scale object-centric dataset containing over 1.2M images of vehicles and pedestrians. We demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.
arXiv Detail & Related papers (2023-04-04T23:41:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.