High Fidelity Image Synthesis With Deep VAEs In Latent Space
- URL: http://arxiv.org/abs/2303.13714v1
- Date: Thu, 23 Mar 2023 23:45:19 GMT
- Title: High Fidelity Image Synthesis With Deep VAEs In Latent Space
- Authors: Troy Luhman, Eric Luhman
- Abstract summary: We present fast, realistic image generation on high-resolution, multimodal datasets using hierarchical variational autoencoders (VAEs)
In this two-stage setup, the autoencoder compresses the image into its semantic features, which are then modeled with a deep VAE.
We demonstrate the effectiveness of our two-stage approach, achieving a FID of 9.34 on the ImageNet-256 dataset which is comparable to BigGAN.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present fast, realistic image generation on high-resolution, multimodal
datasets using hierarchical variational autoencoders (VAEs) trained on a
deterministic autoencoder's latent space. In this two-stage setup, the
autoencoder compresses the image into its semantic features, which are then
modeled with a deep VAE. With this method, the VAE avoids modeling the
fine-grained details that constitute the majority of the image's code length,
allowing it to focus on learning its structural components. We demonstrate the
effectiveness of our two-stage approach, achieving a FID of 9.34 on the
ImageNet-256 dataset which is comparable to BigGAN. We make our implementation
available online.
Related papers
- Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection [13.840950434728533]
State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models.
We leverage the image representations extracted by intermediate Transformer blocks of CLIP's image-encoder via a lightweight network.
Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement.
arXiv Detail & Related papers (2024-02-29T12:18:43Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - Soft-IntroVAE for Continuous Latent space Image Super-Resolution [12.344557879284219]
We propose a Soft-introVAE for continuous latent space image super-resolution (SVAE-SR)
Inspired by Variational AutoEncoder, we propose a Soft-introVAE for continuous latent space image super-resolution (SVAE-SR)
arXiv Detail & Related papers (2023-07-18T06:54:42Z) - A Model-data-driven Network Embedding Multidimensional Features for
Tomographic SAR Imaging [5.489791364472879]
We propose a new model-data-driven network to achieve tomoSAR imaging based on multi-dimensional features.
We add two 2D processing modules, both convolutional encoder-decoder structures, to enhance multi-dimensional features of the imaging scene effectively.
Compared with the conventional CS-based FISTA method and DL-based gamma-Net method, the result of our proposed method has better performance on completeness while having decent imaging accuracy.
arXiv Detail & Related papers (2022-11-28T02:01:43Z) - Wider and Higher: Intensive Integration and Global Foreground Perception
for Image Matting [44.51635913732913]
This paper reviews recent deep-learning-based matting research and conceives our wider and higher motivation for image matting.
Image matting is essentially a pixel-wise regression, and the ideal situation is to perceive the maximum opacity from the input image.
We propose an Intensive Integration and Global Foreground Perception network (I2GFP) to integrate wider and higher feature streams.
arXiv Detail & Related papers (2022-10-13T11:34:46Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Spatial Dependency Networks: Neural Layers for Improved Generative Image
Modeling [79.15521784128102]
We introduce a novel neural network for building image generators (decoders) and apply it to variational autoencoders (VAEs)
In our spatial dependency networks (SDNs), feature maps at each level of a deep neural net are computed in a spatially coherent way.
We show that augmenting the decoder of a hierarchical VAE by spatial dependency layers considerably improves density estimation.
arXiv Detail & Related papers (2021-03-16T07:01:08Z) - Unsupervised Real Image Super-Resolution via Generative Variational
AutoEncoder [47.53609520395504]
We revisit the classic example based image super-resolution approaches and come up with a novel generative model for perceptual image super-resolution.
We propose a joint image denoising and super-resolution model via Variational AutoEncoder.
With the aid of the discriminator, an additional overhead of super-resolution subnetwork is attached to super-resolve the denoised image with photo-realistic visual quality.
arXiv Detail & Related papers (2020-04-27T13:49:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.