3D-aware Image Generation and Editing with Multi-modal Conditions
- URL: http://arxiv.org/abs/2403.06470v1
- Date: Mon, 11 Mar 2024 07:10:37 GMT
- Title: 3D-aware Image Generation and Editing with Multi-modal Conditions
- Authors: Bo Li, Yi-ke Li, Zhi-fen He, Bin Liu, and Yun-Kun Lai
- Abstract summary: 3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision.
We propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs.
Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image.
- Score: 6.444512435220748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D-consistent image generation from a single 2D semantic label is an
important and challenging research topic in computer graphics and computer
vision. Although some related works have made great progress in this field,
most of the existing methods suffer from poor disentanglement performance of
shape and appearance, and lack multi-modal control. In this paper, we propose a
novel end-to-end 3D-aware image generation and editing model incorporating
multiple types of conditional inputs, including pure noise, text and reference
image. On the one hand, we dive into the latent space of 3D Generative
Adversarial Networks (GANs) and propose a novel disentanglement strategy to
separate appearance features from shape features during the generation process.
On the other hand, we propose a unified framework for flexible image generation
and editing tasks with multi-modal conditions. Our method can generate diverse
images with distinct noises, edit the attribute through a text description and
conduct style transfer by giving a reference RGB image. Extensive experiments
demonstrate that the proposed method outperforms alternative approaches both
qualitatively and quantitatively on image generation and editing.
Related papers
- Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation [48.595946437886774]
We build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt.
Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation.
arXiv Detail & Related papers (2024-04-26T13:55:39Z) - MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text [52.296914125558864]
The generation of 3D scenes from user-specified conditions offers a promising avenue for alleviating the production burden in 3D applications.
Previous studies required significant effort to realize the desired scene, owing to limited control conditions.
We propose a method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts.
arXiv Detail & Related papers (2024-03-30T12:50:25Z) - Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation [12.693847842218604]
We introduce a novel 3D customization method, dubbed Make-Your-3D, that can personalize high-fidelity and consistent 3D content within 5 minutes.
Our key insight is to harmonize the distributions of a multi-view diffusion model and an identity-specific 2D generative model, aligning them with the distribution of the desired 3D subject.
Our method can produce high-quality, consistent, and subject-specific 3D content with text-driven modifications that are unseen in subject image.
arXiv Detail & Related papers (2024-03-14T17:57:04Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models.
Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z) - Collaborative Score Distillation for Consistent Visual Synthesis [70.29294250371312]
Collaborative Score Distillation (CSD) is based on the Stein Variational Gradient Descent (SVGD)
We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes.
Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
arXiv Detail & Related papers (2023-07-04T17:31:50Z) - StyleAvatar3D: Leveraging Image-Text Diffusion Models for High-Fidelity
3D Avatar Generation [103.88928334431786]
We present a novel method for generating high-quality, stylized 3D avatars.
We use pre-trained image-text diffusion models for data generation and a Generative Adversarial Network (GAN)-based 3D generation network for training.
Our approach demonstrates superior performance over current state-of-the-art methods in terms of visual quality and diversity of the produced avatars.
arXiv Detail & Related papers (2023-05-30T13:09:21Z) - DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image
Diffusion for 3D Generative Model [18.362036050304987]
3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes.
Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles.
Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models.
arXiv Detail & Related papers (2022-11-29T16:54:34Z) - Multi-View Consistent Generative Adversarial Networks for 3D-aware Image
Synthesis [48.33860286920389]
3D-aware image synthesis aims to generate images of objects from multiple views by learning a 3D representation.
Existing approaches lack geometry constraints, hence usually fail to generate multi-view consistent images.
We propose Multi-View Consistent Generative Adrial Networks (MVCGAN) for high-quality 3D-aware image synthesis with geometry constraints.
arXiv Detail & Related papers (2022-04-13T11:23:09Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.