3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds
- URL: http://arxiv.org/abs/2507.06484v1
- Date: Wed, 09 Jul 2025 02:00:17 GMT
- Title: 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds
- Authors: Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, Clemens Eppner, Valts Blukis, Jonathan Tremblay, Jiajun Wu, Stan Birchfield, Nick Haber,
- Abstract summary: We propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models.<n>Our proposed framework, 3D-Generalist, trains Vision-Language-Models to generate more prompt-aligned 3D environments.<n>We demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data.
- Score: 23.329458437342684
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment's layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.
Related papers
- R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation [78.26308457952636]
This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome limitations in autonomous driving simulation.<n>It enables realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time.<n>We show that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer.
arXiv Detail & Related papers (2025-06-09T14:50:19Z) - Automating 3D Dataset Generation with Neural Radiance Fields [0.0]
Training performant detection models require diverse, precisely annotated, and large scale datasets.<n>We propose a pipeline for automatic generation of 3D datasets for arbitrary objects.<n>Our pipeline is fast, easy to use and has a high degree of automation.
arXiv Detail & Related papers (2025-03-20T10:01:32Z) - TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models [69.0220314849478]
TripoSG is a new streamlined shape diffusion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images.<n>The resulting 3D shapes exhibit enhanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input images.<n>To foster progress and innovation in the field of 3D generation, we will make our model publicly available.
arXiv Detail & Related papers (2025-02-10T16:07:54Z) - Open-Vocabulary High-Resolution 3D (OVHR3D) Data Segmentation and Annotation Framework [1.1280113914145702]
This research aims to design and develop a comprehensive and efficient framework for 3D segmentation tasks.<n>The framework integrates Grounding DINO and Segment anything Model, augmented by an enhancement in 2D image rendering via 3D mesh.
arXiv Detail & Related papers (2024-12-09T07:39:39Z) - Diffusion Models in 3D Vision: A Survey [18.805222552728225]
3D vision has become a crucial field within computer vision, powering a range of applications such as autonomous driving, robotics, augmented reality, and medical imaging.<n>We review the state-of-the-art methods that use diffusion models for 3D visual tasks, including but not limited to 3D object generation, shape completion, point-cloud reconstruction, and scene construction.<n>We discuss potential solutions, including improving computational efficiency, enhancing multimodal fusion, and exploring the use of large-scale pretraining for better generalization across 3D tasks.
arXiv Detail & Related papers (2024-10-07T04:12:23Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication [50.541882834405946]
We introduce Atlas3D, an automatic and easy-to-implement text-to-3D method.
Our approach combines a novel differentiable simulation-based loss function with physically inspired regularization.
We verify Atlas3D's efficacy through extensive generation tasks and validate the resulting 3D models in both simulated and real-world environments.
arXiv Detail & Related papers (2024-05-28T18:33:18Z) - 3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.
We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action.
Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - GINA-3D: Learning to Generate Implicit Neural Assets in the Wild [38.51391650845503]
GINA-3D is a generative model that uses real-world driving data from camera and LiDAR sensors to create 3D implicit neural assets of diverse vehicles and pedestrians.
We construct a large-scale object-centric dataset containing over 1.2M images of vehicles and pedestrians.
We demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.
arXiv Detail & Related papers (2023-04-04T23:41:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.