Related papers: Trellis: Learning to Compress Key-Value Memory in Attention Models

Trellis: Learning to Compress Key-Value Memory in Attention Models

URL: http://arxiv.org/abs/2512.23852v1
Date: Mon, 29 Dec 2025 20:32:10 GMT
Title: Trellis: Learning to Compress Key-Value Memory in Attention Models
Authors: Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni,
Abstract summary: This paper introduces Trellis, a novel Transformer architecture with bounded memory.<n> Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory.<n>Experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines.
Score: 48.12167339402521
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its performance gains increase as the sequence length grows, highlighting its potential for long-context applications.

Related papers

Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z)
Towards Compressive and Scalable Recurrent Memory [16.831420033939548]
Transformers face a quadratic bottleneck in attention when scaling to long contexts.<n>Recent approaches introduce recurrent memory to extend context beyond the current window.<n>We introduce Elastic Memory, a novel memory architecture grounded in the HiPPO framework for online function approximation.
arXiv Detail & Related papers (2026-02-11T07:21:49Z)
Fast-weight Product Key Memory [4.223740794663811]
We propose Fast-weight Product Key Memory (FwPKM) to transform the sparse Product Key Memory (PKM) into a dynamic, "fast-weight" episodic memory.<n>Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules.
arXiv Detail & Related papers (2026-01-02T12:37:53Z)
Lattice: Learning to Efficiently Compress the Memory [13.765057453744427]
This paper introduces Lattice, a novel recurrent neural network (RNN) mechanism that efficiently compress the cache into a fixed number of memory slots.<n>We formulate this compression as an online optimization problem and derive a dynamic memory update rule based on a single gradient descent step.<n>The experimental results show that Lattice achieves the best perplexity compared to all baselines across diverse context lengths.
arXiv Detail & Related papers (2025-04-08T03:48:43Z)
CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)<n>CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.<n>Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z)
SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation. In this work, we focus on developing an efficient compression technique for the KV cache. We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)
Recurrent Action Transformer with Memory [39.58317527488534]
This paper proposes a novel model architecture that incorporates a recurrent memory mechanism designed to regulate information retention. We conduct experiments on memory-intensive environments (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid-Memory), classic Atari games, and MuJoCo control environments. The results show that using memory can significantly improve performance in memory-intensive environments, while maintaining or improving results in classic environments.
arXiv Detail & Related papers (2023-06-15T19:29:08Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.