Plug-and-Play Context Feature Reuse for Efficient Masked Generation
- URL: http://arxiv.org/abs/2505.19089v1
- Date: Sun, 25 May 2025 10:57:35 GMT
- Title: Plug-and-Play Context Feature Reuse for Efficient Masked Generation
- Authors: Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang,
- Abstract summary: Masked generative models (MGMs) have emerged as a powerful framework for image synthesis.<n>We introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs.
- Score: 36.563229330549284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.
Related papers
- Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis [79.98107530577576]
DisCon is a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets.<n>DisCon achieves a gFID score of 1.38 on ImageNet 256$times $256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.
arXiv Detail & Related papers (2025-07-02T14:33:52Z) - MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention [10.077033449956806]
Masked autoregressive (MAR) models unify the strengths of masked and autoregressive generation by predicting tokens in a fixed order using bidirectional attention for image generation.<n>While effective, MAR models suffer from significant computational overhead, as they recompute attention and feed-forward representations for all tokens at every decoding step.<n>We propose a training-free generation framework MARch'e to address this inefficiency through two key components: cache-aware attention and selective KV refresh.
arXiv Detail & Related papers (2025-05-22T23:26:56Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Fast Autoregressive Models for Continuous Latent Generation [49.079819389916764]
Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP.<n>Recent work, the masked autoregressive model (MAR) bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head.<n>We propose Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head.
arXiv Detail & Related papers (2025-04-24T13:57:08Z) - GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation [62.77721499671665]
We introduce GigaTok, the first approach to improve image reconstruction, generation, and representation learning when scaling visual tokenizers.<n>We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma.<n>By scaling to $bf3 space billion$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
arXiv Detail & Related papers (2025-04-11T17:59:58Z) - Efficient Generative Modeling with Residual Vector Quantization-Based Tokens [5.949779668853557]
ResGen is an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed.<n>We validate the efficacy and generalizability of the proposed method on two challenging tasks: conditional image generation on ImageNet 256x256 and zero-shot text-to-speech synthesis.<n>As we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.
arXiv Detail & Related papers (2024-12-13T15:31:17Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.<n> LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.82times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Asymmetric Learned Image Compression with Multi-Scale Residual Block,
Importance Map, and Post-Quantization Filtering [15.056672221375104]
Deep learning-based image compression has achieved better ratedistortion (R-D) performance than the latest traditional method, H.266/VVC.
Many leading learned schemes cannot maintain a good trade-off between performance and complexity.
We propose an effcient and effective image coding framework, which achieves similar R-D performance with lower complexity than the state of the art.
arXiv Detail & Related papers (2022-06-21T09:34:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.