Related papers: You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

URL: http://arxiv.org/abs/2412.06699v2
Date: Sat, 14 Dec 2024 15:42:05 GMT
Title: You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Authors: Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang,
Abstract summary: We present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation.<n>The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data.<n>Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities.
Score: 42.67300636733286
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: https://vision.baai.ac.cn/see3d

Related papers

3D-Fixup: Advancing Photo Editing with 3D Priors [32.83193513442457]
3D-Fixup is a new framework for editing 2D images guided by learned 3D priors.<n>We leverage a training-based approach that harnesses the generative power of diffusion models.<n>We show that 3D-Fixup effectively supports complex, identity coherent 3D-aware edits.
arXiv Detail & Related papers (2025-05-15T17:59:51Z)
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness [73.72335146374543]
We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks.
arXiv Detail & Related papers (2025-04-02T16:59:55Z)
Bolt3D: Generating 3D Scenes in Seconds [77.592919825037]
Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
arXiv Detail & Related papers (2025-03-18T17:24:19Z)
Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors [11.156009461711639]
Generative Gaussian Splatting (GGS) is a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+.
arXiv Detail & Related papers (2025-03-17T15:24:04Z)
Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z)
Animate3D: Animating Any 3D Model with Multi-view Video Diffusion [47.05131487114018]
Animate3D is a novel framework for animating any static 3D model. We introduce a framework combining reconstruction and 4D Score Distillation Sampling (4D-SDS) to leverage the multi-view video diffusion priors for animating 3D objects.
arXiv Detail & Related papers (2024-07-16T05:35:57Z)
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts. Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z)
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models [20.084928490309313]
This paper presents a novel method for building scalable 3D generative models utilizing pre-trained video diffusion models. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds.
arXiv Detail & Related papers (2024-03-18T17:59:12Z)
Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior [57.986512832738704]
We present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model. Specifically, we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach. These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model.
arXiv Detail & Related papers (2024-03-14T07:39:59Z)
Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z)
Gait Recognition in the Wild with Dense 3D Representations and A Benchmark [86.68648536257588]
Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes. This paper aims to explore dense 3D representations for gait recognition in the wild. We build the first large-scale 3D representation-based gait recognition dataset, named Gait3D.
arXiv Detail & Related papers (2022-04-06T03:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.