Related papers: Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

URL: http://arxiv.org/abs/2602.05085v1
Date: Wed, 04 Feb 2026 22:09:40 GMT
Title: Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories
Authors: Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, Dong Yu,
Abstract summary: Locas is a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers.<n>We show that proper initialization of such low-rank sideway-FFN-style memories is essential for fast convergence, improved generalization, and catastrophic prevention.
Score: 44.46300411842271
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories -- performed in a principled way by reusing model parameters, activations and/or gradients -- is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02\% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model's general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model's existing internal knowledge.

Related papers

MoVE: Mixture of Value Embeddings -- A New Axis for Scaling Parametric Memory in Autoregressive Models [0.9222161299777548]
We introduce $textbfMoVE (Mixture of Value Embeddings)$, a mechanism that breaks the rigid structural coupling of model capacity to computational cost.<n>MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers.<n>We validate MoVE through strictly controlled experiments on two representative applications of autoregressive modeling: Text Generation and Image Generation.
arXiv Detail & Related papers (2026-01-30T12:07:23Z)
Hyperparameter Transfer with Mixture-of-Expert Layers [51.03005470884366]
Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks.<n>We propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size.
arXiv Detail & Related papers (2026-01-28T03:02:30Z)
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks [17.067788440109137]
Mixture-of-Experts (MoE) models are now standard in state-of-the-art systems.<n>We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills.
arXiv Detail & Related papers (2025-08-26T04:31:28Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
Self-Updatable Large Language Models by Integrating Context into Model Parameters [21.742149718161716]
Small-scale experiences, such as interactions with surrounding objects, require frequent integration in large language models.<n>Current methods embed experiences within model parameters using continual learning, model editing, or knowledge distillation techniques.<n>We propose SELF-PARAM, which embeds experiences directly into model parameters and ensures near-optimal efficacy and long-term retention.
arXiv Detail & Related papers (2024-10-01T08:18:17Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective [106.92016199403042]
We empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. We employ sensitivity-based techniques to extract and align knowledge-specific parameters between different large language models. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer.
arXiv Detail & Related papers (2023-10-17T17:58:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.