MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers
- URL: http://arxiv.org/abs/2602.00398v1
- Date: Fri, 30 Jan 2026 23:25:20 GMT
- Title: MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers
- Authors: Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Arnav Kundu, Mehrdad Farajtabar, Minsik Cho,
- Abstract summary: MemoryLLM aims to decouple feed-forward modules from self-attention.<n>Training them in isolation from self-attention directly using the token embeddings.<n>System bridges the performance gap caused by training FFNs with context-free token-wise embeddings.
- Score: 22.540490024630316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.
Related papers
- Flash Multi-Head Feed-Forward Network [51.82159978122374]
Multi-Head FFN (MH-FFN) is motivated by the structural similarity between single-head attention and FFN.<n>MH-FFN faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension.<n>We propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in akin to FlashAttention, and a design using dynamically weighted parallel sub-networks.
arXiv Detail & Related papers (2025-12-07T20:50:20Z) - Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference [16.71963410333802]
Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks.<n>MoC substantially reduces activation memory during pre-training.<n>MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.
arXiv Detail & Related papers (2025-11-12T13:30:57Z) - Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning [53.72709564555407]
Memo is a transformer-based architecture and training recipe for reinforcement learning.<n>It incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training.<n>We demonstrate Memo's effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings.
arXiv Detail & Related papers (2025-10-22T16:24:47Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [72.27673320976933]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding.<n>Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - MemOS: A Memory OS for AI System [116.87568350346537]
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI)<n>Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.<n>MemOS is a memory operating system that treats memory as a manageable system resource.
arXiv Detail & Related papers (2025-07-04T17:21:46Z) - EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices [3.739419555718102]
Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices.<n>We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs.
arXiv Detail & Related papers (2025-03-28T07:26:37Z) - FFNet: MetaMixer-based Efficient Convolutional Mixer Design [6.8410780175245165]
We present a family of Fast-Forward Networks (FFNet)<n>Despite being composed of only simple operators, FFNet outperforms sophisticated and highly specialized methods in each domain.<n>We propose MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework.
arXiv Detail & Related papers (2024-06-04T07:00:14Z) - LookupFFN: Making Transformers Compute-lite for CPU inference [23.61144705380663]
GPU clusters are the de facto choice for training large deep neural network (DNN) models today.
Several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry.
We study a module which is a workhorse within modern architectures, GEMM based Feed Forward Networks (FFNs) and assess the extent to which it can be made compute- (or FLOP-) lite.
arXiv Detail & Related papers (2024-03-12T00:26:16Z) - One Wide Feedforward is All You Need [3.043080042012617]
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN)
In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant.
We are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder.
arXiv Detail & Related papers (2023-09-04T21:30:21Z) - Towards A Unified View of Sparse Feed-Forward Network in Pretraining
Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models.
We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method.
We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z) - Feature Flow: In-network Feature Flow Estimation for Video Object
Detection [56.80974623192569]
Optical flow is widely used in computer vision tasks to provide pixel-level motion information.
A common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset.
We propose a novel network (IFF-Net) with an textbfIn-network textbfFeature textbfFlow estimation module for video object detection.
arXiv Detail & Related papers (2020-09-21T07:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.