Related papers: CacheFormer: High Attention-Based Segment Caching

CacheFormer: High Attention-Based Segment Caching

URL: http://arxiv.org/abs/2504.13981v1
Date: Fri, 18 Apr 2025 06:34:57 GMT
Title: CacheFormer: High Attention-Based Segment Caching
Authors: Sushant Singh, Ausif Mahmood,
Abstract summary: We show how to efficiently handle long contexts in transformer-based language models with low perplexity.<n>Our enhancements result in an architecture that outperforms ex-isting SOTA architectures with an average perplexity improvement of 8.5% over similar model sizes.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, and Structured state space models (SSMs)., have not fully resolved this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache and virtual memory principle in computers, where in case of a cache miss, not only the needed data is retrieved from the memory, but the adjacent data is also obtained, we apply this concept to handling long contexts by dividing it into small segments. In our design, we retrieve the nearby segments in an uncompressed form when high segment-level attention occurs at the compressed level. Our en-hancements for handling long context include aggregating four attention mechanisms consisting of short sliding window attention, long compressed segmented attention, dynamically retrieving top k high attention uncompressed segments, and overlapping segments in long segment attention to avoid segment fragmentation. These enhancements result in an architecture that outperforms ex-isting SOTA architectures with an average perplexity improvement of 8.5% over similar model sizes.

Related papers

GTA: Grouped-head latenT Attention [44.19575886935378]
A critical bottleneck arises as KV cache and attention computations scale rapidly with text length.<n>We propose textbfGrouped-Head LatentextbfT textbfAttention (GTA), a novel attention mechanism that reduces memory usage and computational complexity while maintaining performance.<n>GTA cuts attention FLOPs by up to emph62.5% versus Grouped-Query Attention and shrink the KV cache by up to emph70%, all while avoiding the extra overhead of Multi-Head Latent Attention
arXiv Detail & Related papers (2025-06-15T07:19:33Z)
Compact Recurrent Transformer with Persistent Memory [16.48606806238812]
The Transformer architecture has shown significant success in many language processing and visual tasks.<n>We propose a novel and efficient Compact Recurrent Transformer (CRT)<n>CRT combines shallow Transformer models that process short local segments with recurrent neural networks to compress and manage a single persistent memory vector.<n>We evaluate CRT on WordPTB and WikiText-103 for next-token-prediction tasks, as well as on the Toyota Smarthome video dataset for classification.
arXiv Detail & Related papers (2025-05-02T00:11:44Z)
Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z)
Inference-Friendly Models With MixAttention [7.103010772135246]
MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks.
arXiv Detail & Related papers (2024-09-23T13:37:25Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models [4.497551890206997]
Self-attention mechanism scales quadratically with sequence length. LongLoRA proposed shifted sparse attention (S(2)-Attn), effectively enabling context extension. SinkLoRA is still not as efficient as vanilla attention, reaching only 39% of the perplexity improvement compared to full attention.
arXiv Detail & Related papers (2024-06-09T07:23:34Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
Feature boosting with efficient attention for scene parsing [6.752935599738123]
This paper presents a novel feature-boosting network that gathers context from multiple levels of feature extraction. It computes the attention weights for each level of representation to generate the final class labels. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
arXiv Detail & Related papers (2024-02-29T15:22:21Z)
Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation [52.11279360934703]
Current prevailing Video Object (VOS) methods usually perform dense matching between the current and reference frames after extracting features. We propose a unified VOS framework, coined as JointFormer, for joint modeling of the three elements of feature, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z)
Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time. The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism. We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.