Related papers: Accelerating Controllable Generation via Hybrid-grained Cache

Accelerating Controllable Generation via Hybrid-grained Cache

URL: http://arxiv.org/abs/2511.11031v1
Date: Fri, 14 Nov 2025 07:35:50 GMT
Title: Accelerating Controllable Generation via Hybrid-grained Cache
Authors: Lin Liu, Huixia Ben, Shuo Wang, Jinda Lu, Junxiang Qiu, Shengeng Tang, Yanbin Hao,
Abstract summary: Controllable generative models have been widely used to improve the realism of synthetic visual content.<n>We propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages.<n>We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality.
Score: 32.75230327622271
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T to 6.70T), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.

Related papers

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching [75.02865981328509]
Caching reduces computation by reusing previously computed model outputs across timesteps.<n>We propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis.<n>SenCache achieves better visual quality than existing caching methods under similar computational budgets.
arXiv Detail & Related papers (2026-02-27T17:36:09Z)
H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models [7.8812023976358425]
H2-cache is a novel hierarchical caching mechanism designed for modern generative diffusion model architectures.<n>Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage.<n>Experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline.
arXiv Detail & Related papers (2025-10-31T04:47:14Z)
Predictive Feature Caching for Training-free Acceleration of Molecular Geometry Generation [67.20779609022108]
Flow matching models generate high-fidelity molecular geometries but incur significant computational costs during inference.<n>This work discusses a training-free caching strategy that accelerates molecular geometry generation.<n> Experiments on the GEOM-Drugs dataset demonstrate that caching achieves a twofold reduction in wall-clock inference time.
arXiv Detail & Related papers (2025-10-06T09:49:14Z)
iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation [49.8026360054331]
iFlame is a novel transformer-based network architecture for mesh generation.<n>We propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms.<n>Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance.
arXiv Detail & Related papers (2025-03-20T19:10:37Z)
Accelerating Diffusion Transformer via Gradient-Optimized Cache [18.32157920050325]
Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse.<n>It is a challenging problem since Progressive error accumulation from cached blocks significantly degrades generation quality.<n>We propose the Gradient- perturbationd Cache (GOC) with two key innovations.<n>GOC achieves IS 216.28 (26.3% higher) and FID 3.907 (43% lower) compared to baseline DiT, while maintaining identical computational costs.
arXiv Detail & Related papers (2025-03-07T05:31:47Z)
CacheQuant: Comprehensively Accelerated Diffusion Models [3.78219736760145]
CacheQuant is a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques.<n> Experimental results show that CacheQuant achieves a 5.18 speedup and 4 compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in CLIP score.
arXiv Detail & Related papers (2025-03-03T09:04:51Z)
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference [9.65524177141491]
Large Language Model (LLM) inference generates output tokens one-by-one, leading to many redundant computations.<n> KV-Cache framework makes a compromise between time and space complexities.<n>Existing studies reduce memory consumption by evicting some of cached data that have less important impact on inference accuracy.<n>We show that customizing the cache size for each layer in a personalized manner can yield a significant memory reduction.
arXiv Detail & Related papers (2024-12-08T11:32:08Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
Rethinking and Accelerating Graph Condensation: A Training-Free Approach with Class Partition [49.41718583061147]
Graph condensation is a data-centric solution to replace the large graph with a small yet informative condensed graph.<n>Existing GC methods suffer from intricate optimization processes, necessitating excessive computing resources and training time.<n>We propose a training-free GC framework termed Class-partitioned Graph Condensation (CGC)<n>CGC condenses the Ogbn-products graph within 30 seconds, achieving a speedup ranging from $102$X to $104$X and increasing accuracy by up to 4.2%.
arXiv Detail & Related papers (2024-05-22T14:57:09Z)
Artificial Intelligence Assisted Collaborative Edge Caching in Small Cell Networks [19.605382256630538]
This paper considers heterogeneous content preference of the users with heterogeneous caching models at the edge nodes. We propose a modified particle swarm optimization (M-PSO) algorithm that efficiently solves the complex constraint problem in a reasonable time.
arXiv Detail & Related papers (2020-05-16T10:39:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.