Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them
on Images
- URL: http://arxiv.org/abs/2011.10650v2
- Date: Tue, 16 Mar 2021 18:33:19 GMT
- Title: Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them
on Images
- Authors: Rewon Child
- Abstract summary: We present a hierarchical VAE that, for the first time, generates samples quickly while outperforming the PixelCNN in log-likelihood on all natural image benchmarks.
In theory, VAEs can represent autoregressive models, as well as faster, better models if they exist, when made sufficiently deep.
- Score: 9.667538864515285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a hierarchical VAE that, for the first time, generates samples
quickly while outperforming the PixelCNN in log-likelihood on all natural image
benchmarks. We begin by observing that, in theory, VAEs can actually represent
autoregressive models, as well as faster, better models if they exist, when
made sufficiently deep. Despite this, autoregressive models have historically
outperformed VAEs in log-likelihood. We test if insufficient depth explains why
by scaling a VAE to greater stochastic depth than previously explored and
evaluating it CIFAR-10, ImageNet, and FFHQ. In comparison to the PixelCNN,
these very deep VAEs achieve higher likelihoods, use fewer parameters, generate
samples thousands of times faster, and are more easily applied to
high-resolution images. Qualitative studies suggest this is because the VAE
learns efficient hierarchical visual representations. We release our source
code and models at https://github.com/openai/vdvae.
Related papers
- Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think [53.2706196341054]
We show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed.
We perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models.
arXiv Detail & Related papers (2024-09-17T16:58:52Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - Masked Images Are Counterfactual Samples for Robust Fine-tuning [77.82348472169335]
Fine-tuning deep learning models can lead to a trade-off between in-distribution (ID) performance and out-of-distribution (OOD) robustness.
We propose a novel fine-tuning method, which uses masked images as counterfactual samples that help improve the robustness of the fine-tuning model.
arXiv Detail & Related papers (2023-03-06T11:51:28Z) - Revisiting Sparse Convolutional Model for Visual Recognition [40.726494290922204]
This paper revisits the sparse convolutional modeling for image classification.
We show that such models have equally strong empirical performance on CIFAR-10, CIFAR-100, and ImageNet datasets.
arXiv Detail & Related papers (2022-10-24T04:29:21Z) - Optimizing Hierarchical Image VAEs for Sample Quality [0.0]
hierarchical variational autoencoders (VAEs) have achieved great density estimation on image modeling tasks.
We attribute this to learned representations that over-emphasize compressing imperceptible details of the image.
We introduce a KL-reweighting strategy to control the amount of infor mation in each latent group, and employ a Gaussian output layer to reduce sharpness in the learning objective.
arXiv Detail & Related papers (2022-10-18T23:10:58Z) - Efficient-VDVAE: Less is more [0.0]
We present modifications to the Very Deep VAE to make it converge up to $2.6times$ faster.
Our models achieve comparable or better negative log-likelihood performance than current state-of-the-art models.
We empirically demonstrate that roughly $3%$ of the hierarchical VAE's latent space dimensions is sufficient to encode most of the image information.
arXiv Detail & Related papers (2022-03-25T16:29:46Z) - Exponentially Tilted Gaussian Prior for Variational Autoencoder [3.52359746858894]
Recent studies show that probabilistic generative models can perform poorly on this task.
We propose the exponentially tilted Gaussian prior distribution for the Variational Autoencoder (VAE)
We show that our model produces high quality image samples which are more crisp than that of a standard Gaussian VAE.
arXiv Detail & Related papers (2021-11-30T18:28:19Z) - NVAE: A Deep Hierarchical Variational Autoencoder [102.29977384039805]
We propose a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization.
We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models.
To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256$times $256 pixels.
arXiv Detail & Related papers (2020-07-08T04:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.