Related papers: Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

URL: http://arxiv.org/abs/2111.12701v1
Date: Wed, 24 Nov 2021 18:55:14 GMT
Title: Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes
Authors: Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P. Breckon, Chris G. Willcocks
Abstract summary: Recent Vector-Quantized image models have overcome the limitation of image resolution but are prohibitively slow and unidirectional as they generate tokens via element-wise autoregressive sampling from the prior. We propose a novel discrete diffusion probabilistic model prior which enables parallel prediction of Vector-Quantized tokens by using an unconstrained Transformer architecture as the backbone.
Score: 15.881911863960774
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Whilst diffusion probabilistic models can generate high quality image content, key limitations remain in terms of both generating high-resolution imagery and their associated high computational requirements. Recent Vector-Quantized image models have overcome this limitation of image resolution but are prohibitively slow and unidirectional as they generate tokens via element-wise autoregressive sampling from the prior. By contrast, in this paper we propose a novel discrete diffusion probabilistic model prior which enables parallel prediction of Vector-Quantized tokens by using an unconstrained Transformer architecture as the backbone. During training, tokens are randomly masked in an order-agnostic manner and the Transformer learns to predict the original tokens. This parallelism of Vector-Quantized token prediction in turn facilitates unconditional generation of globally consistent high-resolution and diverse imagery at a fraction of the computational expense. In this manner, we can generate image resolutions exceeding that of the original training set samples whilst additionally provisioning per-image likelihood estimates (in a departure from generative adversarial approaches). Our approach achieves state-of-the-art results in terms of Density (LSUN Bedroom: 1.51; LSUN Churches: 1.12; FFHQ: 1.20) and Coverage (LSUN Bedroom: 0.83; LSUN Churches: 0.73; FFHQ: 0.80), and performs competitively on FID (LSUN Bedroom: 3.64; LSUN Churches: 4.07; FFHQ: 6.11) whilst offering advantages in terms of both computation and reduced training set requirements.

Related papers

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism. Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z)
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling [17.62612090885471]
ECAR (Efficient Continuous Auto-Regressive Image Generation via Multistage Modeling) is presented. It operates by generating tokens at increasing resolutions while simultaneously denoising the image at each stage. ECAR achieves comparable image quality to DiT Peebles & Xie [2023] while requiring 10$times$ FLOPs reduction and 5$times$ speedup to generate a 256$times $256 image.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens [5.949779668853557]
ResGen is an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. We validate the efficacy and generalizability of the proposed method on two challenging tasks: conditional image generation on ImageNet 256x256 and zero-shot text-to-speech synthesis. As we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.
arXiv Detail & Related papers (2024-12-13T15:31:17Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models. We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
arXiv Detail & Related papers (2024-10-02T16:05:27Z)
Traditional Classification Neural Networks are Good Generators: They are Competitive with DDPMs and GANs [104.72108627191041]
We show that conventional neural network classifiers can generate high-quality images comparable to state-of-the-art generative models. We propose a mask-based reconstruction module to make semantic gradients-aware to synthesize plausible images. We show that our method is also applicable to text-to-image generation by regarding image-text foundation models.
arXiv Detail & Related papers (2022-11-27T11:25:35Z)
Improved Masked Image Generation with Token-Critic [16.749458173904934]
We introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer. A state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity.
arXiv Detail & Related papers (2022-09-09T17:57:21Z)
Megapixel Image Generation with Step-Unrolled Denoising Autoencoders [5.145313322824774]
We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling. These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model. Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
arXiv Detail & Related papers (2022-06-24T15:47:42Z)
Vector Quantized Diffusion Model for Text-to-Image Synthesis [47.09451151258849]
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results.
arXiv Detail & Related papers (2021-11-29T18:59:46Z)
High-Resolution Complex Scene Synthesis with Transformers [6.445605125467574]
coarse-grained synthesis of complex scene images via deep generative models has recently gained popularity. We present an approach to this task, where the generative model is based on pure likelihood training without additional objectives. We show that the resulting system is able to synthesize high-quality images consistent with the given layouts.
arXiv Detail & Related papers (2021-05-13T17:56:07Z)
InfinityGAN: Towards Infinite-Resolution Image Synthesis [92.40782797030977]
We present InfinityGAN, a method to generate arbitrary-resolution images. We show how it trains and infers patch-by-patch seamlessly with low computational resources.
arXiv Detail & Related papers (2021-04-08T17:59:30Z)
Spatially-Adaptive Pixelwise Networks for Fast Image Translation [57.359250882770525]
We introduce a new generator architecture, aimed at fast and efficient high-resolution image-to-image translation. We use pixel-wise networks; that is, each pixel is processed independently of others. Our model is up to 18x faster than state-of-the-art baselines.
arXiv Detail & Related papers (2020-12-05T10:02:03Z)
RAIN: A Simple Approach for Robust and Accurate Image Classification Networks [156.09526491791772]
It has been shown that the majority of existing adversarial defense methods achieve robustness at the cost of sacrificing prediction accuracy. This paper proposes a novel preprocessing framework, which we term Robust and Accurate Image classificatioN(RAIN) RAIN applies randomization over inputs to break the ties between the model forward prediction path and the backward gradient path, thus improving the model robustness. We conduct extensive experiments on the STL10 and ImageNet datasets to verify the effectiveness of RAIN against various types of adversarial attacks.
arXiv Detail & Related papers (2020-04-24T02:03:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.