Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
- URL: http://arxiv.org/abs/2508.20470v1
- Date: Thu, 28 Aug 2025 06:39:41 GMT
- Title: Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
- Authors: Xiaochuan Li, Guoguang Du, Runze Zhang, Liang Jin, Qi Jia, Lihua Lu, Zhenhua Guo, Yaqian Zhao, Haiyang Liu, Tianqi Wang, Changsheng Li, Xiaoli Gong, Rengang Li, Baoyu Fan,
- Abstract summary: This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models.<n>We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input.
- Score: 44.64235988574981
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.
Related papers
- VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator [69.72818094722186]
A text-to-video generator can be combined with a 3D reconstruction system as a "decoder"<n>We introduce VIST3A, a general framework that does just that.<n>We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models.
arXiv Detail & Related papers (2025-10-15T11:55:08Z) - Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors [11.156009461711639]
Generative Gaussian Splatting (GGS) is a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model.<n>We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+.
arXiv Detail & Related papers (2025-03-17T15:24:04Z) - You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale [42.67300636733286]
We present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation.<n>The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data.<n>Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities.
arXiv Detail & Related papers (2024-12-09T17:44:56Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models [20.084928490309313]
This paper presents a novel method for building scalable 3D generative models utilizing pre-trained video diffusion models.
By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model.
The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds.
arXiv Detail & Related papers (2024-03-18T17:59:12Z) - DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields [68.94868475824575]
This paper introduces a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations.
We leverage the strong semantic prior within a 3D generative model to train a semantic decoder.
Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data.
arXiv Detail & Related papers (2023-11-18T21:58:28Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - UniG3D: A Unified 3D Object Generation Dataset [75.49544172927749]
UniG3D is a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on ShapeNet datasets.
This pipeline converts each raw 3D model into comprehensive multi-modal data representation.
The selection of data sources for our dataset is based on their scale and quality.
arXiv Detail & Related papers (2023-06-19T07:03:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.