Efficient Unified Caching for Accelerating Heterogeneous AI Workloads
- URL: http://arxiv.org/abs/2506.12370v1
- Date: Sat, 14 Jun 2025 06:36:54 GMT
- Title: Efficient Unified Caching for Accelerating Heterogeneous AI Workloads
- Authors: Tianze Wang, Yifei Liu, Chen Chen, Pengfei Zuo, Jiawei Zhang, Qizhen Weng, Yin Chen, Zhenhua Han, Jieru Zhao, Quan Chen, Minyi Guo,
- Abstract summary: Modern AI clusters host diverse workloads like data pre-processing, training and inference.<n>To avoid code-intrusion complexity and minimize cache space wastage, it is desirable to maintain a unified cache shared by all the workloads.<n>We propose IGTCache, a unified, high-efficacy cache for modern AI clusters.
- Score: 20.192896983810606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern AI clusters, which host diverse workloads like data pre-processing, training and inference, often store the large-volume data in cloud storage and employ caching frameworks to facilitate remote data access. To avoid code-intrusion complexity and minimize cache space wastage, it is desirable to maintain a unified cache shared by all the workloads. However, existing cache management strategies, designed for specific workloads, struggle to handle the heterogeneous AI workloads in a cluster -- which usually exhibit heterogeneous access patterns and item storage granularities. In this paper, we propose IGTCache, a unified, high-efficacy cache for modern AI clusters. IGTCache leverages a hierarchical access abstraction, AccessStreamTree, to organize the recent data accesses in a tree structure, facilitating access pattern detection at various granularities. Using this abstraction, IGTCache applies hypothesis testing to categorize data access patterns as sequential, random, or skewed. Based on these detected access patterns and granularities, IGTCache tailors optimal cache management strategies including prefetching, eviction, and space allocation accordingly. Experimental results show that IGTCache increases the cache hit ratio by 55.6% over state-of-the-art caching frameworks, reducing the overall job completion time by 52.2%.
Related papers
- LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z) - QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation.<n>DiTs come with significant drawbacks, including increased computational and memory costs.<n>We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z) - InstCache: A Predictive Cache for LLM Serving [6.076957323090607]
Caching techniques offer opportunities to optimize the performance of Large Language Models inference engines.<n>High variability in the content and length of instructions make it rare for identical instructions to recur within a short time window.<n>We propose InstCache, a predictive caching mechanism for LLM serving systems.
arXiv Detail & Related papers (2024-11-21T03:52:41Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns.
Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z) - MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens.
The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation.
We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z) - A Learning-Based Caching Mechanism for Edge Content Delivery [2.412158290827225]
5G networks and the rise of the Internet of Things (IoT) are increasingly extending into the network edge.
This shift introduces unique challenges, particularly due to the limited cache storage and the diverse request patterns at the edge.
We introduce HR-Cache, a learning-based caching framework grounded in the principles of Hazard Rate (HR) ordering.
arXiv Detail & Related papers (2024-02-05T08:06:03Z) - Systematic Evaluation of Randomized Cache Designs against Cache Occupancy [11.018866935621045]
This work fills in a crucial gap in current literature on randomized caches.<n>Most randomized cache designs defend only contention-based attacks, and leave out considerations of cache occupancy.<n>Our results establish the need to also consider cache occupancy side-channel in randomized cache design considerations.
arXiv Detail & Related papers (2023-10-08T14:06:06Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z) - Reinforcement Learning for Caching with Space-Time Popularity Dynamics [61.55827760294755]
caching is envisioned to play a critical role in next-generation networks.
To intelligently prefetch and store contents, a cache node should be able to learn what and when to cache.
This chapter presents a versatile reinforcement learning based approach for near-optimal caching policy design.
arXiv Detail & Related papers (2020-05-19T01:23:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.