Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation
- URL: http://arxiv.org/abs/2404.17419v1
- Date: Fri, 26 Apr 2024 13:55:39 GMT
- Title: Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation
- Authors: Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang,
- Abstract summary: We build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt.
Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation.
- Score: 48.595946437886774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using image as prompts for 3D generation demonstrate particularly strong performances compared to using text prompts alone, for images provide a more intuitive guidance for the 3D generation process. In this work, we delve into the potential of using multiple image prompts, instead of a single image prompt, for 3D generation. Specifically, we build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt. Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation according to various quantitative evaluation metrics and qualitative assessments. This advancement is achieved without the necessity of fine-tuning the pre-trained ImageDream multi-view diffusion model.
Related papers
- Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation [22.699173137070883]
Hunyuan3D-1.0 is a two-stage approach that supports text- and image-conditioned generation.
In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds.
In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset.
Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation.
arXiv Detail & Related papers (2024-11-04T17:21:42Z) - Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions.
We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images.
We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z) - Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model [65.58911408026748]
We propose Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts.
We first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline.
We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation.
arXiv Detail & Related papers (2024-04-28T04:05:10Z) - VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model [34.35449902855767]
Two fundamental questions are what data we use for training and how to ensure multi-view consistency.
We propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models.
Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-18T17:48:15Z) - 3D-aware Image Generation and Editing with Multi-modal Conditions [6.444512435220748]
3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision.
We propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs.
Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image.
arXiv Detail & Related papers (2024-03-11T07:10:37Z) - LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content
Creation [51.19871052619077]
We introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images.
We maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.
arXiv Detail & Related papers (2024-02-07T17:57:03Z) - ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion [61.37481051263816]
Given a single image of a 3D object, this paper proposes a method (named ConsistNet) that is able to generate multiple images of the same object.
Our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-10-16T12:29:29Z) - SyncDreamer: Generating Multiview-consistent Images from a Single-view Image [59.75474518708409]
A novel diffusion model called SyncDreamer generates multiview-consistent images from a single-view image.
Experiments show that SyncDreamer generates images with high consistency across different views.
arXiv Detail & Related papers (2023-09-07T02:28:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.