Related papers: Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

URL: http://arxiv.org/abs/2603.00188v1
Date: Fri, 27 Feb 2026 01:27:20 GMT
Title: Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression
Authors: Bowen Zhou, Zhou Xu, Wanli Li, Jingyu Xiao, Haoqian Wang,
Abstract summary: We propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents.<n>With only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines.
Score: 29.993062853291622
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.

Related papers

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z)
Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models [8.944739362562494]
Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens.<n>We propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimize text-visual token interaction in MLLMs.<n>HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds.
arXiv Detail & Related papers (2026-02-02T15:01:44Z)
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference [14.17979669446161]
We propose HeteroCache, a training-free dynamic compression framework.<n>We show that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3times$ compared to the original model in the 224K context.
arXiv Detail & Related papers (2026-01-20T07:35:06Z)
GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness [75.00019285120878]
Key-value (KV) caching can mitigate this, but storing the full cache is prohibitive for image-heavy contexts.<n>Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs.<n>We introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining.
arXiv Detail & Related papers (2025-10-01T05:37:54Z)
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule [54.37983890753086]
We introduce OjaKV, a framework that integrates a strategic hybrid storage policy with online subspace adaptation.<n>OjaKV preserves crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention.<n>It applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis.
arXiv Detail & Related papers (2025-09-25T21:42:27Z)
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z)
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference [11.73134417321505]
We propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference.<n>We show that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache.
arXiv Detail & Related papers (2025-03-31T11:13:18Z)
TreeKV: Smooth Key-Value Cache Compression with Tree Structures [19.06842704338332]
TreeKV is a training-free method that employs a tree structure for smooth cache compression.<n>It consistently surpasses all baseline models in language modeling tasks on PG19 and OpenWebText2.
arXiv Detail & Related papers (2025-01-09T06:00:27Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [23.431794605498084]
We propose Layer KV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance. Layer KV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory. Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that Layer KV improves TTFT latency up to 69x and reduces SLO violation rates by 28.7%.
arXiv Detail & Related papers (2024-10-01T06:23:17Z)
CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs [89.79139531731637]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.<n>We propose a joint underlinecompression method for ViTs that achieves a harmonious blend of high underlineaccuracy, fast underlineinference speed, and favorable underlinetransferability to downstream tasks.
arXiv Detail & Related papers (2023-09-27T16:12:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.