Related papers: Lattice: Learning to Efficiently Compress the Memory

Lattice: Learning to Efficiently Compress the Memory

URL: http://arxiv.org/abs/2504.05646v1
Date: Tue, 08 Apr 2025 03:48:43 GMT
Title: Lattice: Learning to Efficiently Compress the Memory
Authors: Mahdi Karami, Vahab Mirrokni,
Abstract summary: This paper introduces Lattice, a novel recurrent neural network (RNN) mechanism that efficiently compress the cache into a fixed number of memory slots.<n>We formulate this compression as an online optimization problem and derive a dynamic memory update rule based on a single gradient descent step.<n>The experimental results show that Lattice achieves the best perplexity compared to all baselines across diverse context lengths.
Score: 13.765057453744427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention mechanisms have revolutionized sequence learning but suffer from quadratic computational complexity. This paper introduces Lattice, a novel recurrent neural network (RNN) mechanism that leverages the inherent low-rank structure of K-V matrices to efficiently compress the cache into a fixed number of memory slots, achieving sub-quadratic complexity. We formulate this compression as an online optimization problem and derive a dynamic memory update rule based on a single gradient descent step. The resulting recurrence features a state- and input-dependent gating mechanism, offering an interpretable memory update process. The core innovation is the orthogonal update: each memory slot is updated exclusively with information orthogonal to its current state hence incorporation of only novel, non-redundant data, which minimizes the interference with previously stored information. The experimental results show that Lattice achieves the best perplexity compared to all baselines across diverse context lengths, with performance improvement becoming more pronounced as the context length increases.

Related papers

mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling [0.5236468296934584]
mGRADE is a hybrid-memory system that integrates a temporal 1D-convolution with learnable spacings followed by a minimal gated recurrent unit.<n>We demonstrate that mGRADE effectively separates and preserves multi-scale temporal features.<n>This highlights mGRADE's promise as an efficient solution for memory-constrained multi-scale temporal processing at the edge.
arXiv Detail & Related papers (2025-07-02T15:44:35Z)
MoM: Linear Sequence Modeling with Mixture-of-Memories [9.665802842933209]
We introduce a novel architecture called Mixture-of-Memories (MoM)<n>MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states.<n>MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques.
arXiv Detail & Related papers (2025-02-19T12:53:55Z)
CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)<n>CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.<n>Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z)
MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning [7.262751938473306]
Pruning is a well-established technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. We develop a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.
arXiv Detail & Related papers (2024-08-24T05:54:47Z)
Topology-aware Embedding Memory for Continual Learning on Expanding Networks [63.35819388164267]
We present a framework to tackle the memory explosion problem using memory replay techniques. PDGNNs with Topology-aware Embedding Memory (TEM) significantly outperform state-of-the-art techniques.
arXiv Detail & Related papers (2024-01-24T03:03:17Z)
SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory. We propose StreaMRAK - a streaming version of KRR. We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
HiPPO: Recurrent Memory with Optimal Polynomial Projections [93.3537706398653]
We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal solution to a natural online function approximation problem. This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale.
arXiv Detail & Related papers (2020-08-17T23:39:33Z)
Incremental Training of a Recurrent Neural Network Exploiting a Multi-Scale Dynamic Memory [79.42778415729475]
We propose a novel incrementally trained recurrent architecture targeting explicitly multi-scale learning. We show how to extend the architecture of a simple RNN by separating its hidden state into different modules. We discuss a training algorithm where new modules are iteratively added to the model to learn progressively longer dependencies.
arXiv Detail & Related papers (2020-06-29T08:35:49Z)
Ordering Chaos: Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices [10.876317610988059]
We present a memory-aware compiler, dubbed SERENITY, that finds a sequence that finds a schedule with optimal memory footprint. Our solution also comprises of graph rewriting technique that allows further reduction beyond the optimum.
arXiv Detail & Related papers (2020-03-04T23:38:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.