Related papers: Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation

Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation

URL: http://arxiv.org/abs/2506.18226v1
Date: Mon, 23 Jun 2025 01:27:06 GMT
Title: Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation
Authors: Xunzhi Xiang, Qi Fan,
Abstract summary: We propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA)<n>ADSA identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention.<n>We also introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50%$.
Score: 8.624395048491275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention computation. Additionally, we introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50\%$. Extensive qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in terms of both generation quality and resource efficiency.

Related papers

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [66.73899356886652]
We build an image tokenizer directly atop pre-trained vision foundation models.<n>Our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.<n>It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks.
arXiv Detail & Related papers (2025-07-11T09:32:45Z)
Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective [47.87649021414188]
We present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity.<n>Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency.
arXiv Detail & Related papers (2025-07-02T12:27:06Z)
Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization'<n>This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z)
Dynamic Attention Mechanism in Spatiotemporal Memory Networks for Object Tracking [8.040709469401257]
We propose a differentiable dynamic attention mechanism that adaptively channel adjusts attention weights by analyzing spatial attention weights.<n>A lightweight gating network that autonomously allocates computational resources based on target motion states, prioritizes high-discriminability features in challenging scenarios.
arXiv Detail & Related papers (2025-03-21T00:48:31Z)
Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.<n>We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z)
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z)
Visual Autoregressive Modeling for Image Super-Resolution [14.935662351654601]
We propose a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.<n>We collect large-scale data and design a training process to obtain robust generative priors.
arXiv Detail & Related papers (2025-01-31T09:53:47Z)
Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images [4.3565203412433195]
Whole slide image (WSI) analysis presents significant computational challenges due to the massive number of patches in gigapixel images.<n>We propose Querent, i.e., the query-aware long contextual dynamic modeling framework.<n>Our approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations.
arXiv Detail & Related papers (2025-01-31T09:29:21Z)
StyleRWKV: High-Quality and High-Efficiency Style Transfer with RWKV-like Architecture [29.178246094092202]
Style transfer aims to generate a new image preserving the content but with the artistic representation of the style source.<n>Most of the existing methods are based on Transformers or diffusion models, however, they suffer from quadratic computational complexity and high inference time.<n>We present a novel framework StyleRWKV, to achieve high-quality style transfer with limited memory usage and linear time complexity.
arXiv Detail & Related papers (2024-12-27T09:01:15Z)
CD-NGP: A Fast Scalable Continual Representation for Dynamic Scenes [31.783117836434403]
Current methods for novel view synthesis (NVS) in dynamic scenes encounter significant challenges in managing memory consumption, model complexity, training efficiency, and rendering fidelity.<n>We propose continual dynamic neural graphics primitives (CD-NGP) to address these issues.<n>Our approach leverages a continual learning framework to reduce memory overhead, and it also integrates features from distinct temporal and spatial hash encodings for high rendering quality.
arXiv Detail & Related papers (2024-09-08T17:35:48Z)
DynaSeg: A Deep Dynamic Fusion Method for Unsupervised Image Segmentation Incorporating Feature Similarity and Spatial Continuity [0.5755004576310334]
We introduce DynaSeg, an innovative unsupervised image segmentation approach. Unlike traditional methods, DynaSeg employs a dynamic weighting scheme that adapts flexibly to image characteristics. DynaSeg prevents undersegmentation failures where the number of predicted clusters might converge to one.
arXiv Detail & Related papers (2024-05-09T00:30:45Z)
Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output. Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion. We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.