Related papers: MIDUS: Memory-Infused Depth Up-Scaling

MIDUS: Memory-Infused Depth Up-Scaling

URL: http://arxiv.org/abs/2512.13751v1
Date: Mon, 15 Dec 2025 05:50:45 GMT
Title: MIDUS: Memory-Infused Depth Up-Scaling
Authors: Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song,
Abstract summary: Depth Up-Scaling (DUS) has emerged as a promising strategy by duplicating layers and applying Continual Pre-training (CPT)<n>We introduce Memory-Infused Depth Up-Scaling (MIDUS), which replaces FFNs in duplicated blocks with a head-wise memory layer.<n>Our findings establish MIDUS as a compelling and resource-efficient alternative to conventional FFN replication for depth up-scaling.
Score: 20.802982614533615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling large language models (LLMs) demands approaches that increase capacity without incurring excessive parameter growth or inference cost. Depth Up-Scaling (DUS) has emerged as a promising strategy by duplicating layers and applying Continual Pre-training (CPT), but its reliance on feed-forward networks (FFNs) limits efficiency and attainable gains. We introduce Memory-Infused Depth Up-Scaling (MIDUS), which replaces FFNs in duplicated blocks with a head-wise memory (HML) layer. Motivated by observations that attention heads have distinct roles both across and within layers, MIDUS assigns an independent memory bank to each head, enabling head-wise retrieval and injecting information into subsequent layers while preserving head-wise functional structure. This design combines sparse memory access with head-wise representations and incorporates an efficient per-head value factorization module, thereby relaxing the usual efficiency-performance trade-off. Across our CPT experiments, MIDUS exhibits robust performance improvements over strong DUS baselines while maintaining a highly efficient parameter footprint. Our findings establish MIDUS as a compelling and resource-efficient alternative to conventional FFN replication for depth up-scaling by leveraging its head-wise memory design.

Related papers

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z)
Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z)
MemFly: On-the-Fly Memory Optimization via Information Bottleneck [35.420309099411874]
Long-term memory enables large language model agents to tackle complex tasks through historical interactions.<n>Existing frameworks encounter a dilemma between compressing redundant information efficiently and maintaining precise retrieval for downstream tasks.<n>MemFly is a framework grounded in information bottleneck principles that facilitates on-the-fly memory evolution for LLMs.<n>MemFly substantially outperforms state-of-the-art baselines in memory coherence, response fidelity, and accuracy.
arXiv Detail & Related papers (2026-02-08T09:37:25Z)
Explicit Multi-head Attention for Inter-head Interaction in Large Language Models [70.96854312026319]
Multi-head Explicit Attention (MEA) is a simple yet effective attention variant that explicitly models cross-head interaction.<n>MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence.<n>This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss.
arXiv Detail & Related papers (2026-01-27T13:45:03Z)
Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents [57.38404718635204]
Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows.<n>Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components.<n>We propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy.
arXiv Detail & Related papers (2026-01-05T08:24:16Z)
Flash Multi-Head Feed-Forward Network [51.82159978122374]
Multi-Head FFN (MH-FFN) is motivated by the structural similarity between single-head attention and FFN.<n>MH-FFN faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension.<n>We propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in akin to FlashAttention, and a design using dynamically weighted parallel sub-networks.
arXiv Detail & Related papers (2025-12-07T20:50:20Z)
Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference [16.71963410333802]
Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks.<n>MoC substantially reduces activation memory during pre-training.<n>MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.
arXiv Detail & Related papers (2025-11-12T13:30:57Z)
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension [55.29309306566238]
Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents.<n>This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents.<n>We draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory -- structured schemata, flexible assimilation, and dynamic accommodation.
arXiv Detail & Related papers (2025-10-07T02:16:30Z)
SEDM: Scalable Self-Evolving Distributed Memory for Agents [23.182291416527764]
SEDM is a verifiable and adaptive framework that transforms memory from a passive repository into an active, self-optimizing component.<n>We show that SEDM improves reasoning accuracy while reducing token overhead compared with strong memory baselines.<n>Results highlight SEDM as a scalable and sustainable memory mechanism for open-ended multi-agent collaboration.
arXiv Detail & Related papers (2025-09-11T14:37:37Z)
SAS: Simulated Attention Score [75.1409882298863]
We introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head.<n>Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method.
arXiv Detail & Related papers (2025-07-10T12:16:16Z)
Dynamic Memory-enhanced Transformer for Hyperspectral Image Classification [3.5093938502961763]
Hyperspectral image (HSI) classification remains a challenging task due to the intricate spatial-spectral correlations.<n>Existing transformer models excel in capturing long-range dependencies but often suffer from information redundancy and attention inefficiencies.<n>MemFormer introduces a memory-enhanced multi-head attention mechanism that iteratively refines a dynamic memory module.<n>A dynamic memory enrichment strategy progressively captures complex spatial and spectral dependencies, leading to more expressive feature representations.
arXiv Detail & Related papers (2025-04-17T17:43:34Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
FedMef: Towards Memory-efficient Federated Dynamic Pruning [42.07105095641134]
Federated learning (FL) promotes decentralized training while prioritizing data confidentiality. Its application on resource-constrained devices is challenging due to the high demand for computation and memory resources to train deep learning models. We propose FedMef, a novel and memory-efficient federated dynamic pruning framework.
arXiv Detail & Related papers (2024-03-21T13:54:36Z)
Contractive error feedback for gradient compression [60.05809370598166]
We propose a communication efficient method called contractive error feedback (ConEF) As opposed to SGD with error-feedback (EFSGD) that inefficiently manages memory, ConEF obtains the sweet spot of convergence and memory usage. We empirically validate ConEF on various learning tasks that include image classification, language modeling, and machine translation.
arXiv Detail & Related papers (2023-12-13T21:54:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.