Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
- URL: http://arxiv.org/abs/2601.07372v1
- Date: Mon, 12 Jan 2026 09:54:49 GMT
- Title: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
- Authors: Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, Wenfeng Liang,
- Abstract summary: We introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup.<n>We scale Engram to 27B parameters, achieving superior performance over a strictly iso- parameter and iso-FLOPs MoE baseline.<n>We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.
- Score: 42.816060150754645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.
Related papers
- ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory [21.4675019810992]
Concept-level memory is reusable, modular abstractions distilled from solution traces and stored in natural language.<n>We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning.<n>We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales.
arXiv Detail & Related papers (2025-09-04T17:54:19Z) - Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization'<n>This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z) - Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.<n>On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.<n>We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z) - Dynamic layer selection in decoder-only transformers [21.18795712840146]
We empirically examine two common dynamic inference methods for natural language generation.
We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping.
We also show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains.
arXiv Detail & Related papers (2024-10-26T00:44:11Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - Re2G: Retrieve, Rerank, Generate [14.848179433828252]
We propose Re2G, which combines neural initial retrieval and reranking into a BART-based sequence-to-sequence generation.
To train our system end-to-end, we introduce a novel variation of knowledge distillation to train the initial retrieval, reranker, and generation using only ground truth on the target sequence output.
We find incomparable gains in four diverse tasks: zero-shot slot filling, question answering, fact-checking, and dialog, with relative gains of 9% to 34% over the previous state-of-the-art on the KILT leaderboard.
arXiv Detail & Related papers (2022-07-13T15:51:40Z) - Memory-Guided Semantic Learning Network for Temporal Sentence Grounding [55.31041933103645]
We propose a memory-augmented network that learns and memorizes the rarely appeared content in TSG tasks.
MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module.
arXiv Detail & Related papers (2022-01-03T02:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.