Related papers: Equivalence of Context and Parameter Updates in Modern Transformer Blocks

Equivalence of Context and Parameter Updates in Modern Transformer Blocks

URL: http://arxiv.org/abs/2511.17864v1
Date: Sat, 22 Nov 2025 01:17:15 GMT
Title: Equivalence of Context and Parameter Updates in Modern Transformer Blocks
Authors: Adrian Goldwaser, Michael Munn, Javier Gonzalvo, Benoit Dherin,
Abstract summary: Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its weights.<n>We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches.<n>We then generalize this result, providing a constructive proof and algorithm for multi-layer models.
Score: 8.364690240329411
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer function is output-controllable. This provides a simpler and more powerful lens for understanding how transformer models transmute prompts into effective weights. This setup generalizes to a wide range of modern LLM architectures including gating, pre-/post-norm, mixture of experts and sequential/parallel transformer blocks.

Related papers

Improving Recursive Transformers with Mixture of LoRAs [2.672804414228544]
Mixture of LoRAs (MoL) inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN)<n>MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters.<n>ModernALBERT achieves state-of-the-art performance among compact models.
arXiv Detail & Related papers (2025-12-14T23:39:30Z)
MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts [0.0]
Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token.<n>Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks.<n>This paper introducesMoE (MLP-Experts), a training-free transformation that restructures the dense in transformer blocks into a static, high-cardinality mixture of experts.
arXiv Detail & Related papers (2025-11-26T06:14:26Z)
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers [30.88764351013966]
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains.<n>Recent works have observed redundancy within transformer blocks and developed compression methods by structured pruning of less important blocks.<n>We propose FuseGPT, a novel methodology designed to recycle pruned transformer blocks, thereby recovering the model's performance.
arXiv Detail & Related papers (2024-11-21T09:49:28Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)<n>Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.<n>Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z)
PIDformer: Transformer Meets Control Theory [28.10913642120948]
We unveil self-attention as an autonomous state-space model that inherently promotes smoothness in its solutions. We incorporate a Proportional-Integral-Derivative (PID) closed-loop feedback control system with a reference point into the model to improve robustness and representation capacity. Motivated by this control framework, we derive a novel class of transformers, PID-controlled Transformer (PIDformer)
arXiv Detail & Related papers (2024-02-25T05:04:51Z)
SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many computation tasks.<n>We show that the dense connections can be replaced with a sparse block diagonal structure that supports larger expansion ratios.<n>We also propose the use of a lightweight, parameter-free, channel covariance attention mechanism as a parallel branch during training.
arXiv Detail & Related papers (2023-12-01T08:22:34Z)
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST) CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z)
A Closer Look at In-Context Learning under Distribution Shifts [24.59271215602147]
We aim to better understand the generality and limitations of in-context learning from the lens of the simple yet fundamental task of linear regression. We find that both transformers and set-based distributions exhibit in-context learning under-distribution evaluations, but transformers more closely emulate the performance of ordinary least squares (OLS) Transformers also display better resilience to mild distribution shifts, where set-based distributions falter.
arXiv Detail & Related papers (2023-05-26T07:47:21Z)
Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens) We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.