Related papers: Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

URL: http://arxiv.org/abs/2410.01699v1
Date: Wed, 2 Oct 2024 16:05:27 GMT
Title: Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
Authors: Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu,
Abstract summary: We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
Score: 60.188309982690335
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The current large auto-regressive models can generate high-quality, high-resolution images, but these models require hundreds or even thousands of steps of next-token prediction during inference, resulting in substantial time consumption. In existing studies, Jacobi decoding, an iterative parallel decoding algorithm, has been used to accelerate the auto-regressive generation and can be executed without training. However, the Jacobi decoding relies on a deterministic criterion to determine the convergence of iterations. Thus, it works for greedy decoding but is incompatible with sampling-based decoding which is crucial for visual quality and diversity in the current auto-regressive text-to-image generation. In this paper, we propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding and allowing the model to generate diverse images. Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion, enabling the model to generate images with fewer steps than the conventional next-token-prediction paradigm. We also investigate the token initialization strategies that leverage the spatial locality of visual data to further improve the acceleration ratio under specific scenarios. We conduct experiments for our proposed SJD on multiple auto-regressive text-to-image generation models, showing the effectiveness of model acceleration without sacrificing the visual quality.

Related papers

SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation [92.34355601866629]
Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed.<n>We propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm.<n>SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps.
arXiv Detail & Related papers (2025-12-08T12:36:43Z)
BIGFix: Bidirectional Image Generation with Token Fixing [21.40682276355247]
We propose a method for self-correcting image generation by iteratively refining sampled tokens.<n>We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling.<n>We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.
arXiv Detail & Related papers (2025-10-14T07:34:44Z)
Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation [110.28291466364784]
Speculative Jacobi-Denoising Decoding (SJD2) is a framework that incorporates the denoising process into Jacobi to enable parallel token generation in autoregressive models.<n>Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings.
arXiv Detail & Related papers (2025-10-10T04:30:45Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Fast Autoregressive Video Generation with Diagonal Decoding [34.90521536645348]
Diagonal Decoding (DiagD) is a training-free inference acceleration algorithm for autoregressively pre-trained models. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame. DiagD achieves up to $10times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.
arXiv Detail & Related papers (2025-03-18T09:42:55Z)
Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z)
Parallelized Autoregressive Visual Generation [65.9579525736345]
We propose a simple yet effective approach for parallelized autoregressive visual generation. Our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks.
arXiv Detail & Related papers (2024-12-19T17:59:54Z)
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens [5.949779668853557]
ResGen is an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. We validate the efficacy and generalizability of the proposed method on two challenging tasks: conditional image generation on ImageNet 256x256 and zero-shot text-to-speech synthesis. As we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.
arXiv Detail & Related papers (2024-12-13T15:31:17Z)
Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation. RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL. We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.76times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z)
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs) We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z)
Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel. Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
arXiv Detail & Related papers (2023-12-22T10:01:54Z)
Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models [52.29800567587504]
We propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information. TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts. We validate the efficacy of TCTS combined with Frequency Adaptive Sampling (FAS) with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality.
arXiv Detail & Related papers (2023-04-04T03:52:49Z)
Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z)
Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner. Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration. We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z)
Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL) Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.