Related papers: Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

URL: http://arxiv.org/abs/2011.10650v2
Date: Tue, 16 Mar 2021 18:33:19 GMT
Title: Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images
Authors: Rewon Child
Abstract summary: We present a hierarchical VAE that, for the first time, generates samples quickly while outperforming the PixelCNN in log-likelihood on all natural image benchmarks. In theory, VAEs can represent autoregressive models, as well as faster, better models if they exist, when made sufficiently deep.
Score: 9.667538864515285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a hierarchical VAE that, for the first time, generates samples quickly while outperforming the PixelCNN in log-likelihood on all natural image benchmarks. We begin by observing that, in theory, VAEs can actually represent autoregressive models, as well as faster, better models if they exist, when made sufficiently deep. Despite this, autoregressive models have historically outperformed VAEs in log-likelihood. We test if insufficient depth explains why by scaling a VAE to greater stochastic depth than previously explored and evaluating it CIFAR-10, ImageNet, and FFHQ. In comparison to the PixelCNN, these very deep VAEs achieve higher likelihoods, use fewer parameters, generate samples thousands of times faster, and are more easily applied to high-resolution images. Qualitative studies suggest this is because the VAE learns efficient hierarchical visual representations. We release our source code and models at https://github.com/openai/vdvae.

Related papers

FlowR: Flowing from Sparse to Dense 3D Reconstructions [60.28571003356382]
We propose a flow matching model that learns a flow to connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions.<n>Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass.
arXiv Detail & Related papers (2025-04-02T11:57:01Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models. We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think [53.2706196341054]
We show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. We perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models.
arXiv Detail & Related papers (2024-09-17T16:58:52Z)
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM. ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z)
Masked Images Are Counterfactual Samples for Robust Fine-tuning [77.82348472169335]
Fine-tuning deep learning models can lead to a trade-off between in-distribution (ID) performance and out-of-distribution (OOD) robustness. We propose a novel fine-tuning method, which uses masked images as counterfactual samples that help improve the robustness of the fine-tuning model.
arXiv Detail & Related papers (2023-03-06T11:51:28Z)
Revisiting Sparse Convolutional Model for Visual Recognition [40.726494290922204]
This paper revisits the sparse convolutional modeling for image classification. We show that such models have equally strong empirical performance on CIFAR-10, CIFAR-100, and ImageNet datasets.
arXiv Detail & Related papers (2022-10-24T04:29:21Z)
Optimizing Hierarchical Image VAEs for Sample Quality [0.0]
hierarchical variational autoencoders (VAEs) have achieved great density estimation on image modeling tasks. We attribute this to learned representations that over-emphasize compressing imperceptible details of the image. We introduce a KL-reweighting strategy to control the amount of infor mation in each latent group, and employ a Gaussian output layer to reduce sharpness in the learning objective.
arXiv Detail & Related papers (2022-10-18T23:10:58Z)
Efficient-VDVAE: Less is more [0.0]
We present modifications to the Very Deep VAE to make it converge up to $2.6times$ faster. Our models achieve comparable or better negative log-likelihood performance than current state-of-the-art models. We empirically demonstrate that roughly $3%$ of the hierarchical VAE's latent space dimensions is sufficient to encode most of the image information.
arXiv Detail & Related papers (2022-03-25T16:29:46Z)
Exponentially Tilted Gaussian Prior for Variational Autoencoder [3.52359746858894]
Recent studies show that probabilistic generative models can perform poorly on this task. We propose the exponentially tilted Gaussian prior distribution for the Variational Autoencoder (VAE) We show that our model produces high quality image samples which are more crisp than that of a standard Gaussian VAE.
arXiv Detail & Related papers (2021-11-30T18:28:19Z)
NVAE: A Deep Hierarchical Variational Autoencoder [102.29977384039805]
We propose a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256$times $256 pixels.
arXiv Detail & Related papers (2020-07-08T04:56:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.