Related papers: Language Modeling With Factorization Memory

Language Modeling With Factorization Memory

URL: http://arxiv.org/abs/2511.00315v1
Date: Fri, 31 Oct 2025 23:27:11 GMT
Title: Language Modeling With Factorization Memory
Authors: Lee Xiong, Maksim Tkachenko, Johanes Effendi, Ting Cai,
Abstract summary: We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks.<n>We develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart.
Score: 1.9538130634206368
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks while also demonstrating superior generalization in long-context scenarios. Our model builds upon Mamba-2, enabling Factorization Memory to exploit parallel computations during training while preserving constant computational and memory complexity during inference. To further optimize model efficiency and representational capacity, we develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart. To our knowledge, this represents the first RNN architecture that successfully combines sparse memory activation with competitive performance across both short and long-context settings. This work provides a systematic empirical analysis of Factorization Memory in comparison to Transformer and Mamba-2 architectures.

Related papers

Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z)
HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling [7.24393498822329]
HyMem is a hybrid memory architecture that enables dynamic on-demand scheduling through multi-granular memory representations.<n>We show that HyMem achieves strong performance on both the LOCOMO and LongMemEval benchmarks, outperforming full-context while reducing computational cost by 92.6%.
arXiv Detail & Related papers (2026-02-15T00:06:19Z)
Parallelizable memory recurrent units [1.3159512679346688]
We introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs.<n>We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.
arXiv Detail & Related papers (2026-01-14T14:01:11Z)
MemMamba: Rethinking Memory Patterns in State Space Model [6.537535831000493]
We show that selective state-space models such as Mamba have high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially.<n>Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba.<n>MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks.
arXiv Detail & Related papers (2025-09-28T14:40:58Z)
ATLAS: Learning to Optimally Memorize the Context at Test Time [31.41718170413687]
ATLAS is a long-term memory module with high capacity that learns to memorize the context.<n>We present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture.
arXiv Detail & Related papers (2025-05-29T17:57:16Z)
Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization'<n>This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z)
Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.<n>On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.<n>We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z)
B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module. B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z)
Estimation of Energy-dissipation Lower-bounds for Neuromorphic Learning-in-memory [5.073292775065559]
An ideal neuromorphic neurally-inspired neurally-equilibriums rely on local but parallel parameter updates to solve problems that range from quadratic programming to Ising machines.<n>An analysis presented in this paper captures the out-of- thermodynamics of learning and the resulting energy-efficiency estimates are model-agnostic.<n>To show practical applicability of our results, we apply our analysis for estimating the lower-bound on the energy-to-solution metrics for large-scale AI workloads.
arXiv Detail & Related papers (2024-02-21T21:02:11Z)
Recurrent Action Transformer with Memory [39.58317527488534]
This paper proposes a novel model architecture that incorporates a recurrent memory mechanism designed to regulate information retention. We conduct experiments on memory-intensive environments (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid-Memory), classic Atari games, and MuJoCo control environments. The results show that using memory can significantly improve performance in memory-intensive environments, while maintaining or improving results in classic environments.
arXiv Detail & Related papers (2023-06-15T19:29:08Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
Memory Transformer [0.31406146587437894]
Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations. We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance.
arXiv Detail & Related papers (2020-06-20T09:06:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.