Related papers: LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

URL: http://arxiv.org/abs/2503.12450v1
Date: Sun, 16 Mar 2025 10:54:59 GMT
Title: LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching
Authors: Feihong Yan, Qingyan Wei, Jiayi Tang, Jiajun Li, Yulin Wang, Xuming Hu, Huiqi Li, Linfeng Zhang,
Abstract summary: Masked Autoregressive (MAR) models have emerged as a promising approach in image generation.<n>We propose LazyMAR, which introduces two caching mechanisms to handle them one by one.<n>Our method achieves 2.83 times acceleration with almost no drop in generation quality.
Score: 33.024044212891326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy: Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83 times acceleration with almost no drop in generation quality. Our codes will be released in https://github.com/feihongyan1/LazyMAR.

Related papers

AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model [59.065471969232284]
We propose a novel Aligned Tokenizer (AliTok) to align the tokenizer and autoregressive model.<n>On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator, AliTok achieves a gFID score of 1.50 and an IS of 305.9.<n>When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed.
arXiv Detail & Related papers (2025-06-05T17:45:10Z)
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)
Plug-and-Play Context Feature Reuse for Efficient Masked Generation [36.563229330549284]
Masked generative models (MGMs) have emerged as a powerful framework for image synthesis.<n>We introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs.
arXiv Detail & Related papers (2025-05-25T10:57:35Z)
MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention [10.077033449956806]
Masked autoregressive (MAR) models unify the strengths of masked and autoregressive generation by predicting tokens in a fixed order using bidirectional attention for image generation.<n>While effective, MAR models suffer from significant computational overhead, as they recompute attention and feed-forward representations for all tokens at every decoding step.<n>We propose a training-free generation framework MARch'e to address this inefficiency through two key components: cache-aware attention and selective KV refresh.
arXiv Detail & Related papers (2025-05-22T23:26:56Z)
Fast Autoregressive Models for Continuous Latent Generation [49.079819389916764]
Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP. Recent work, the masked autoregressive model (MAR) bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head. We propose Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head.
arXiv Detail & Related papers (2025-04-24T13:57:08Z)
Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model [2.580765958706854]
Diffusion models are hindered by their high computational cost and slow inference.<n>One such approach focuses on reducing the number of tokens fed into the self-attention, known as token merging (ToMe)
arXiv Detail & Related papers (2025-01-01T20:16:27Z)
Parallelized Autoregressive Visual Generation [65.9579525736345]
We propose a simple yet effective approach for parallelized autoregressive visual generation.<n>Our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks.
arXiv Detail & Related papers (2024-12-19T17:59:54Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.<n> LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.82times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition [80.22784377150465]
Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding. This paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER. NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph (PGD)
arXiv Detail & Related papers (2024-07-16T04:52:39Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Self-Supervised Bernoulli Autoencoders for Semi-Supervised Hashing [1.8899300124593648]
This paper investigates the robustness of hashing methods based on variational autoencoders to the lack of supervision. We propose a novel supervision method in which the model uses its label distribution predictions to implement the pairwise objective. Our experiments show that both methods can significantly increase the hash codes' quality.
arXiv Detail & Related papers (2020-07-17T07:47:10Z)
LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention [54.18121922040521]
Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass. These NAT models often suffer from the multimodality problem, generating duplicated tokens or missing tokens. We propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism.
arXiv Detail & Related papers (2020-02-08T04:11:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.