3D-aware Image Generation and Editing with Multi-modal Conditions
- URL: http://arxiv.org/abs/2403.06470v1
- Date: Mon, 11 Mar 2024 07:10:37 GMT
- Title: 3D-aware Image Generation and Editing with Multi-modal Conditions
- Authors: Bo Li, Yi-ke Li, Zhi-fen He, Bin Liu, and Yun-Kun Lai
- Abstract summary: 3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision.
We propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs.
Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image.
- Score: 6.444512435220748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D-consistent image generation from a single 2D semantic label is an
important and challenging research topic in computer graphics and computer
vision. Although some related works have made great progress in this field,
most of the existing methods suffer from poor disentanglement performance of
shape and appearance, and lack multi-modal control. In this paper, we propose a
novel end-to-end 3D-aware image generation and editing model incorporating
multiple types of conditional inputs, including pure noise, text and reference
image. On the one hand, we dive into the latent space of 3D Generative
Adversarial Networks (GANs) and propose a novel disentanglement strategy to
separate appearance features from shape features during the generation process.
On the other hand, we propose a unified framework for flexible image generation
and editing tasks with multi-modal conditions. Our method can generate diverse
images with distinct noises, edit the attribute through a text description and
conduct style transfer by giving a reference RGB image. Extensive experiments
demonstrate that the proposed method outperforms alternative approaches both
qualitatively and quantitatively on image generation and editing.
Related papers
- GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation [75.39457097832113]
This paper introduces a novel 3D generation framework, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space.
Our framework employs a Variational Autoencoder with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information.
The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs.
arXiv Detail & Related papers (2024-11-12T18:59:32Z) - Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation [48.595946437886774]
We build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt.
Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation.
arXiv Detail & Related papers (2024-04-26T13:55:39Z) - MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text [52.296914125558864]
The generation of 3D scenes from user-specified conditions offers a promising avenue for alleviating the production burden in 3D applications.
Previous studies required significant effort to realize the desired scene, owing to limited control conditions.
We propose a method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts.
arXiv Detail & Related papers (2024-03-30T12:50:25Z) - Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation [12.693847842218604]
We introduce a novel 3D customization method, dubbed Make-Your-3D, that can personalize high-fidelity and consistent 3D content within 5 minutes.
Our key insight is to harmonize the distributions of a multi-view diffusion model and an identity-specific 2D generative model, aligning them with the distribution of the desired 3D subject.
Our method can produce high-quality, consistent, and subject-specific 3D content with text-driven modifications that are unseen in subject image.
arXiv Detail & Related papers (2024-03-14T17:57:04Z) - ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data.
We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models.
Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z) - Collaborative Score Distillation for Consistent Visual Synthesis [70.29294250371312]
Collaborative Score Distillation (CSD) is based on the Stein Variational Gradient Descent (SVGD)
We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes.
Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
arXiv Detail & Related papers (2023-07-04T17:31:50Z) - Multi-View Consistent Generative Adversarial Networks for 3D-aware Image
Synthesis [48.33860286920389]
3D-aware image synthesis aims to generate images of objects from multiple views by learning a 3D representation.
Existing approaches lack geometry constraints, hence usually fail to generate multi-view consistent images.
We propose Multi-View Consistent Generative Adrial Networks (MVCGAN) for high-quality 3D-aware image synthesis with geometry constraints.
arXiv Detail & Related papers (2022-04-13T11:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.