Code2Worlds: Empowering Coding LLMs for 4D World Generation
- URL: http://arxiv.org/abs/2602.11757v1
- Date: Thu, 12 Feb 2026 09:34:28 GMT
- Title: Code2Worlds: Empowering Coding LLMs for 4D World Generation
- Authors: Yi Zhang, Yunshuang Wang, Zeyu Zhang, Hao Tang,
- Abstract summary: We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation.<n>We propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration.<n>We establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code.
- Score: 14.349376975089607
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multi-scale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods. Code: https://github.com/AIGeeksGroup/Code2Worlds. Website: https://aigeeksgroup.github.io/Code2Worlds.
Related papers
- SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens [89.05195827071582]
SceMoS is a scene-aware motion synthesis framework.<n>It disentangles global planning from local execution using lightweight 2D cues.<n>SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark.
arXiv Detail & Related papers (2026-02-24T02:09:12Z) - Code2World: A GUI World Model via Renderable Code Generation [37.96080847935199]
We propose Code2World, a vision-feedback coder that simulates the next visual state via renderable code generation.<n>Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image.
arXiv Detail & Related papers (2026-02-10T14:56:19Z) - RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z) - SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning [11.93789125154006]
We propose a framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency.<n>SNOW processes synchronized 3D point clouds, using HDBSCAN clustering to generate segmentation proposals.<n> Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference.
arXiv Detail & Related papers (2025-12-18T12:27:06Z) - 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer [40.29321632546414]
4DVGGT is the first Transformer-based feed-forward unified framework for 4D language grounding.<n>It integrates geometric perception and language alignment within a single architecture.<n>It can be jointly trained across multiple dynamic scenes and directly applied during inference.
arXiv Detail & Related papers (2025-12-04T18:15:27Z) - Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding [54.859943475818234]
We present Motion4D, a novel framework that integrates 2D priors from foundation models into a unified 4D Gaussian Splatting representation.<n>Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence.<n>Our method significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis.
arXiv Detail & Related papers (2025-12-03T09:32:56Z) - WorldGrow: Generating Infinite 3D World [75.81531067447203]
We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance.<n>We propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis.<n>Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity.
arXiv Detail & Related papers (2025-10-24T17:39:52Z) - Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation [61.60600246983274]
Existing 3D and 4D approaches typically embed scene geometry into autogressive model for semantic understanding and diffusion model for content generation.<n>We propose Uni4D-LLM, the first unified VLM framework withtemporal awareness for 4D scene understanding and generation.
arXiv Detail & Related papers (2025-09-28T12:06:54Z) - Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos [70.07088203106443]
Existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations.<n>Prior Masked Autoentangler (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data.<n>We propose a novel self-disentangled MAE for learning expressive,riminative, and transferable 4D representations.
arXiv Detail & Related papers (2025-04-07T08:47:36Z) - Language Conditioned Traffic Generation [37.71751991840586]
LCTGen is a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps.
It produces an initial traffic distribution, as well as the dynamics of each vehicle.
LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity.
arXiv Detail & Related papers (2023-07-16T05:10:32Z) - Class-agnostic Reconstruction of Dynamic Objects from Videos [127.41336060616214]
We introduce REDO, a class-agnostic framework to REconstruct the Dynamic Objects from RGBD or calibrated videos.
We develop two novel modules. First, we introduce a canonical 4D implicit function which is pixel-aligned with aggregated temporal visual cues.
Second, we develop a 4D transformation module which captures object dynamics to support temporal propagation and aggregation.
arXiv Detail & Related papers (2021-12-03T18:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.