High Fidelity Image Synthesis With Deep VAEs In Latent Space
- URL: http://arxiv.org/abs/2303.13714v1
- Date: Thu, 23 Mar 2023 23:45:19 GMT
- Title: High Fidelity Image Synthesis With Deep VAEs In Latent Space
- Authors: Troy Luhman, Eric Luhman
- Abstract summary: We present fast, realistic image generation on high-resolution, multimodal datasets using hierarchical variational autoencoders (VAEs)
In this two-stage setup, the autoencoder compresses the image into its semantic features, which are then modeled with a deep VAE.
We demonstrate the effectiveness of our two-stage approach, achieving a FID of 9.34 on the ImageNet-256 dataset which is comparable to BigGAN.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present fast, realistic image generation on high-resolution, multimodal
datasets using hierarchical variational autoencoders (VAEs) trained on a
deterministic autoencoder's latent space. In this two-stage setup, the
autoencoder compresses the image into its semantic features, which are then
modeled with a deep VAE. With this method, the VAE avoids modeling the
fine-grained details that constitute the majority of the image's code length,
allowing it to focus on learning its structural components. We demonstrate the
effectiveness of our two-stage approach, achieving a FID of 9.34 on the
ImageNet-256 dataset which is comparable to BigGAN. We make our implementation
available online.
Related papers
- Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection [13.840950434728533]
State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models.
We leverage the image representations extracted by intermediate Transformer blocks of CLIP's image-encoder via a lightweight network.
Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement.
arXiv Detail & Related papers (2024-02-29T12:18:43Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - Matryoshka Diffusion Models [41.05745850547664]
Diffusion models are the de facto approach for generating high-quality images and videos.
We introduce Matryoshka Diffusion Models, an end-to-end framework for high-resolution image and video synthesis.
We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications.
arXiv Detail & Related papers (2023-10-23T17:20:01Z) - Soft-IntroVAE for Continuous Latent space Image Super-Resolution [12.344557879284219]
We propose a Soft-introVAE for continuous latent space image super-resolution (SVAE-SR)
Inspired by Variational AutoEncoder, we propose a Soft-introVAE for continuous latent space image super-resolution (SVAE-SR)
arXiv Detail & Related papers (2023-07-18T06:54:42Z) - Embracing Compact and Robust Architectures for Multi-Exposure Image
Fusion [50.598654017728045]
We propose a search-based paradigm, involving self-alignment and detail repletion modules for robust multi-exposure image fusion.
By utilizing scene relighting and deformable convolutions, the self-alignment module can accurately align images despite camera movement.
We realize the state-of-the-art performance in comparison to various competitive schemes, yielding a 4.02% and 29.34% improvement in PSNR for general and misaligned scenarios.
arXiv Detail & Related papers (2023-05-20T17:01:52Z) - A Model-data-driven Network Embedding Multidimensional Features for
Tomographic SAR Imaging [5.489791364472879]
We propose a new model-data-driven network to achieve tomoSAR imaging based on multi-dimensional features.
We add two 2D processing modules, both convolutional encoder-decoder structures, to enhance multi-dimensional features of the imaging scene effectively.
Compared with the conventional CS-based FISTA method and DL-based gamma-Net method, the result of our proposed method has better performance on completeness while having decent imaging accuracy.
arXiv Detail & Related papers (2022-11-28T02:01:43Z) - Wider and Higher: Intensive Integration and Global Foreground Perception
for Image Matting [44.51635913732913]
This paper reviews recent deep-learning-based matting research and conceives our wider and higher motivation for image matting.
Image matting is essentially a pixel-wise regression, and the ideal situation is to perceive the maximum opacity from the input image.
We propose an Intensive Integration and Global Foreground Perception network (I2GFP) to integrate wider and higher feature streams.
arXiv Detail & Related papers (2022-10-13T11:34:46Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Spatial Dependency Networks: Neural Layers for Improved Generative Image
Modeling [79.15521784128102]
We introduce a novel neural network for building image generators (decoders) and apply it to variational autoencoders (VAEs)
In our spatial dependency networks (SDNs), feature maps at each level of a deep neural net are computed in a spatially coherent way.
We show that augmenting the decoder of a hierarchical VAE by spatial dependency layers considerably improves density estimation.
arXiv Detail & Related papers (2021-03-16T07:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.