DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder
- URL: http://arxiv.org/abs/2206.00386v1
- Date: Wed, 1 Jun 2022 10:39:12 GMT
- Title: DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder
- Authors: Jie Shi, Chenfei Wu, Jian Liang, Xiang Liu, Nan Duan
- Abstract summary: We propose a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis.
Our model achieves state-of-the-art results and generates more photorealistic images specifically.
- Score: 73.1010640692609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently most successful image synthesis models are multi stage process to
combine the advantages of different methods, which always includes a VAE-like
model for faithfully reconstructing embedding to image and a prior model to
generate image embedding. At the same time, diffusion models have shown be
capacity to generate high-quality synthetic images. Our work proposes a VQ-VAE
architecture model with a diffusion decoder (DiVAE) to work as the
reconstructing component in image synthesis. We explore how to input image
embedding into diffusion model for excellent performance and find that simple
modification on diffusion's UNet can achieve it. Training on ImageNet, Our
model achieves state-of-the-art results and generates more photorealistic
images specifically. In addition, we apply the DiVAE with an Auto-regressive
generator on conditional synthesis tasks to perform more human-feeling and
detailed samples.
Related papers
- Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with
Synthetic Images [37.29348016920314]
We present a new framework leveraging off-the-shelf generative models to generate synthetic training images.
We address class name ambiguity, lack of diversity in naive prompts, and domain shifts.
Our framework consistently enhances recognition model performance with more synthetic data.
arXiv Detail & Related papers (2023-12-04T18:35:27Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Semantic Image Synthesis with Semantically Coupled VQ-Model [42.19799555533789]
We conditionally synthesize the latent space from a vector quantized model (VQ-model) pre-trained to autoencode images.
We show that our model improves semantic image synthesis using autoregressive models on popular semantic image datasets ADE20k, Cityscapes and COCO-Stuff.
arXiv Detail & Related papers (2022-09-06T14:37:01Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from
Low-Dimensional Latents [26.17940552906923]
We present DiffuseVAE, a novel generative framework that integrates VAE within a diffusion model framework.
We show that the proposed model can generate high-resolution samples and exhibits quality comparable to state-of-the-art models on standard benchmarks.
arXiv Detail & Related papers (2022-01-02T06:44:23Z) - High-Resolution Image Synthesis with Latent Diffusion Models [14.786952412297808]
Training diffusion models on autoencoders allows for the first time to reach a near-optimal point between complexity reduction and detail preservation.
Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks.
arXiv Detail & Related papers (2021-12-20T18:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.