Related papers: Constructing Efficient Fact-Storing MLPs for Transformers

Constructing Efficient Fact-Storing MLPs for Transformers

URL: http://arxiv.org/abs/2512.00207v1
Date: Fri, 28 Nov 2025 21:18:35 GMT
Title: Constructing Efficient Fact-Storing MLPs for Transformers
Authors: Owen Dugan, Roberto Garcia, Ronny Junkins, Jerry Liu, Dylan Zinsley, Sabri Eyuboglu, Atri Rudra, Chris Ré,
Abstract summary: We build explicit weight constructions to build fact-storings in large language models.<n>We demonstrate a proof-of-concept application of fact-storings: modular fact editing on one-layer Transformers by textit entires at once.
Score: 9.371973249870207
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The success of large language models (LLMs) can be attributed in part to their ability to efficiently store factual knowledge as key-value mappings within their MLP parameters. Recent work has proposed explicit weight constructions to build such fact-storing MLPs, providing an improved understanding of LLM fact storage mechanisms. In this paper, we introduce an MLP construction framework that improves over previous constructions in three areas: it 1) works for all but a measure-zero set of feasible input-output pairs, 2) achieves asymptotically optimal parameter efficiency matching information-theoretic bounds for some embeddings, and 3) maintains usability within Transformers for factual recall. Through our improvements, we 1) discover a metric on value embeddings that characterizes facts-per-parameter scaling for both constructed and gradient-descent-trained MLPs, 2) identify a simple encoder-decoder mechanism that empirically matches gradient-descent MLP facts-per-parameter asymptotics across all the inputs and outputs we test, and 3) uncover a fundamental tradeoff between an MLP's fact-storage capacity and its usability within Transformers. Finally, we demonstrate a proof-of-concept application of fact-storing MLPs: modular fact editing on one-layer Transformers by \textit{replacing entire MLPs at once}.

Related papers

Equivalence of Context and Parameter Updates in Modern Transformer Blocks [8.364690240329411]
Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its weights.<n>We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches.<n>We then generalize this result, providing a constructive proof and algorithm for multi-layer models.
arXiv Detail & Related papers (2025-11-22T01:17:15Z)
PraxiMLP: A Threshold-based Framework for Efficient Three-Party MLP with Practical Security [3.0489147795290683]
PraxiMLP is a highly efficient three-party framework for privacy-Preserving Machine Learning (PPML)<n>PraxiMLP operates entirely within the arithmetic domain, thus avoiding expensive cross-domain conversions.<n>By supporting loating-point numbers, PraxiMLP precisely handles non-linear functions, dramatically improving both efficiency and precision.
arXiv Detail & Related papers (2025-11-08T18:56:26Z)
Understanding Factual Recall in Transformers via Associative Memories [55.93756571457904]
We show that shallow transformers can use a combination of associative memories to obtain near optimal storage capacity.<n>We show that a transformer with a single layer of self-attention followed by an parameters can obtain 100% accuracy on a factual recall task.
arXiv Detail & Related papers (2024-12-09T14:48:14Z)
MLPs Learn In-Context on Regression and Classification Tasks [28.13046236900491]
In-context learning (ICL) is often assumed to be a unique hallmark of Transformer models.<n>We demonstrate that multi-layer perceptrons (MLPs) can also learn in-context.<n>Results highlight the unexpected competence of exemplars in a synthetic setting.
arXiv Detail & Related papers (2024-05-24T15:04:36Z)
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models [68.83330172211315]
We study mechanisms employed by Transformer-based language models (LLMs) for factual recall tasks. We propose a novel analytic method aimed at decomposing the outputs of the outputs into components understandable by humans. We mitigate this suppression by leveraging our interpretation to improve factual recall confidence.
arXiv Detail & Related papers (2024-03-28T15:54:59Z)
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z)
Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens) We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z)
AS-MLP: An Axial Shifted MLP Architecture for Vision [50.11765148947432]
An Axial Shifted architecture (AS-MLP) is proposed in this paper. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different directions. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset.
arXiv Detail & Related papers (2021-07-18T08:56:34Z)
Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex. This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.