Megapixel Image Generation with Step-Unrolled Denoising Autoencoders
- URL: http://arxiv.org/abs/2206.12351v1
- Date: Fri, 24 Jun 2022 15:47:42 GMT
- Title: Megapixel Image Generation with Step-Unrolled Denoising Autoencoders
- Authors: Alex F. McKinney, Chris G. Willcocks
- Abstract summary: We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling.
These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model.
Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
- Score: 5.145313322824774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An ongoing trend in generative modelling research has been to push sample
resolutions higher whilst simultaneously reducing computational requirements
for training and sampling. We aim to push this trend further via the
combination of techniques - each component representing the current pinnacle of
efficiency in their respective areas. These include vector-quantized GAN
(VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy -
but perceptually insignificant - compression; hourglass transformers, a highly
scaleable self-attention model; and step-unrolled denoising autoencoders
(SUNDAE), a non-autoregressive (NAR) text generative model. Unexpectedly, our
method highlights weaknesses in the original formulation of hourglass
transformers when applied to multidimensional data. In light of this, we
propose modifications to the resampling mechanism, applicable in any task
applying hierarchical transformers to multidimensional data. Additionally, we
demonstrate the scalability of SUNDAE to long sequence lengths - four times
longer than prior work. Our proposed framework scales to high-resolutions
($1024 \times 1024$) and trains quickly (2-4 days). Crucially, the trained
model produces diverse and realistic megapixel samples in approximately 2
seconds on a consumer-grade GPU (GTX 1080Ti). In general, the framework is
flexible: supporting an arbitrary number of sampling steps, sample-wise
self-stopping, self-correction capabilities, conditional generation, and a NAR
formulation that allows for arbitrary inpainting masks. We obtain FID scores of
10.56 on FFHQ256 - close to the original VQ-GAN in less than half the sampling
steps - and 21.85 on FFHQ1024 in only 100 sampling steps.
Related papers
- FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution [33.07779971446476]
We propose FlowDCN, a purely convolution-based generative model that can efficiently generate high-quality images at arbitrary resolutions.
FlowDCN achieves the state-of-the-art 4.30 sFID on $256times256$ ImageNet Benchmark and comparable resolution extrapolation results.
We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.
arXiv Detail & Related papers (2024-10-30T02:48:50Z) - Parallel Sampling of Diffusion Models [76.3124029406809]
Diffusion models are powerful generative models but suffer from slow sampling.
We present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel.
arXiv Detail & Related papers (2023-05-25T17:59:42Z) - Preconditioned Score-based Generative Models [49.88840603798831]
An intuitive acceleration method is to reduce the sampling iterations which however causes severe performance degradation.
We propose a model-agnostic bfem preconditioned diffusion sampling (PDS) method that leverages matrix preconditioning to alleviate the aforementioned problem.
PDS alters the sampling process of a vanilla SGM at marginal extra computation cost, and without model retraining.
arXiv Detail & Related papers (2023-02-13T16:30:53Z) - Accelerating Large Language Model Decoding with Speculative Sampling [9.851546623666588]
speculative sampling is an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call.
We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup.
arXiv Detail & Related papers (2023-02-02T18:44:11Z) - Latent Autoregressive Source Separation [5.871054749661012]
This paper introduces vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models.
Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens.
arXiv Detail & Related papers (2023-01-09T17:32:00Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Unleashing Transformers: Parallel Token Prediction with Discrete
Absorbing Diffusion for Fast High-Resolution Image Generation from
Vector-Quantized Codes [15.881911863960774]
Recent Vector-Quantized image models have overcome the limitation of image resolution but are prohibitively slow and unidirectional as they generate tokens via element-wise autoregressive sampling from the prior.
We propose a novel discrete diffusion probabilistic model prior which enables parallel prediction of Vector-Quantized tokens by using an unconstrained Transformer architecture as the backbone.
arXiv Detail & Related papers (2021-11-24T18:55:14Z) - Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge.
We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z) - Anytime Sampling for Autoregressive Models via Ordered Autoencoding [88.01906682843618]
Autoregressive models are widely used for tasks such as image and audio generation.
The sampling process of these models does not allow interruptions and cannot adapt to real-time computational resources.
We propose a new family of autoregressive models that enables anytime sampling.
arXiv Detail & Related papers (2021-02-23T05:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.