Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
- URL: http://arxiv.org/abs/2510.25739v1
- Date: Wed, 29 Oct 2025 17:43:31 GMT
- Title: Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
- Authors: Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan,
- Abstract summary: Speculative decoding has shown promise in accelerating text generation without compromising quality.<n>We introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions.<n> Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models.
- Score: 87.00172597953228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
Related papers
- Enhancing Spatial Understanding in Image Generation via Reward Modeling [23.754373024995132]
We introduce a novel method that strengthens the spatial understanding of current image generation models.<n>We build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation.
arXiv Detail & Related papers (2026-02-27T17:59:57Z) - Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing [62.94394079771687]
A burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents.<n>We propose a systematic framework to adapt understanding-oriented encoder features for generative tasks.<n>We show that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both Text-to-Image (T2I) and image editing tasks.
arXiv Detail & Related papers (2025-12-19T18:59:57Z) - Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization [50.5332987313297]
We propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module.<n>TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution.<n>In experiments on MS-COCO and three diffusion backbones, TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality.
arXiv Detail & Related papers (2025-11-25T00:42:09Z) - Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling [14.372824543814602]
Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation.<n>We introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages.<n>Experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality.
arXiv Detail & Related papers (2025-10-20T05:22:10Z) - Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy [23.573364375818553]
This work revisits the sampling issues in current autoregressive (AR) image generation models.<n>We identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution.<n>We present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed.
arXiv Detail & Related papers (2025-10-10T05:26:11Z) - Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [118.52589065972795]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z) - One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models [65.96186414865747]
Text-to-Image (T2I) diffusion models face a trade-off between inference speed and image quality.<n>We introduce the first Time-independent Unified TiUE for the student model UNet architecture.<n>Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling.
arXiv Detail & Related papers (2025-05-28T04:23:22Z) - NFIG: Multi-Scale Autoregressive Image Generation via Frequency Ordering [47.442844594442455]
NextFrequency Image Generation (NFIG) is a novel framework that decomposes the image generation process into multiple frequency-guided stages.<n> NFIG aligns the generation process with the natural image structure.<n>It does this by first generating low-frequency components, which efficiently capture global structure with significantly fewer tokens, and then progressively adding higher-frequency details.
arXiv Detail & Related papers (2025-03-10T08:59:10Z) - Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.<n>We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z) - Visual Autoregressive Modeling for Image Super-Resolution [14.935662351654601]
We propose a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.<n>We collect large-scale data and design a training process to obtain robust generative priors.
arXiv Detail & Related papers (2025-01-31T09:53:47Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.57727062920458]
We present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL.<n>We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers.<n>Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.<n> LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.82times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.<n>There exists a trade-off between reconstruction and generation quality regarding token length.<n>We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD)<n>SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.<n>Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion.
arXiv Detail & Related papers (2024-10-02T16:05:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.