Equivalence of Context and Parameter Updates in Modern Transformer Blocks
- URL: http://arxiv.org/abs/2511.17864v1
- Date: Sat, 22 Nov 2025 01:17:15 GMT
- Title: Equivalence of Context and Parameter Updates in Modern Transformer Blocks
- Authors: Adrian Goldwaser, Michael Munn, Javier Gonzalvo, Benoit Dherin,
- Abstract summary: Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its weights.<n>We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches.<n>We then generalize this result, providing a constructive proof and algorithm for multi-layer models.
- Score: 8.364690240329411
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer function is output-controllable. This provides a simpler and more powerful lens for understanding how transformer models transmute prompts into effective weights. This setup generalizes to a wide range of modern LLM architectures including gating, pre-/post-norm, mixture of experts and sequential/parallel transformer blocks.
Related papers
- Improving Recursive Transformers with Mixture of LoRAs [2.672804414228544]
Mixture of LoRAs (MoL) inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN)<n>MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters.<n>ModernALBERT achieves state-of-the-art performance among compact models.
arXiv Detail & Related papers (2025-12-14T23:39:30Z) - MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts [0.0]
Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token.<n>Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks.<n>This paper introducesMoE (MLP-Experts), a training-free transformation that restructures the dense in transformer blocks into a static, high-cardinality mixture of experts.
arXiv Detail & Related papers (2025-11-26T06:14:26Z) - FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers [30.88764351013966]
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains.<n>Recent works have observed redundancy within transformer blocks and developed compression methods by structured pruning of less important blocks.<n>We propose FuseGPT, a novel methodology designed to recycle pruned transformer blocks, thereby recovering the model's performance.
arXiv Detail & Related papers (2024-11-21T09:49:28Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)<n>Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.<n>Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - PIDformer: Transformer Meets Control Theory [28.10913642120948]
We unveil self-attention as an autonomous state-space model that inherently promotes smoothness in its solutions.
We incorporate a Proportional-Integral-Derivative (PID) closed-loop feedback control system with a reference point into the model to improve robustness and representation capacity.
Motivated by this control framework, we derive a novel class of transformers, PID-controlled Transformer (PIDformer)
arXiv Detail & Related papers (2024-02-25T05:04:51Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many computation tasks.<n>We show that the dense connections can be replaced with a sparse block diagonal structure that supports larger expansion ratios.<n>We also propose the use of a lightweight, parameter-free, channel covariance attention mechanism as a parallel branch during training.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - A Closer Look at In-Context Learning under Distribution Shifts [24.59271215602147]
We aim to better understand the generality and limitations of in-context learning from the lens of the simple yet fundamental task of linear regression.
We find that both transformers and set-based distributions exhibit in-context learning under-distribution evaluations, but transformers more closely emulate the performance of ordinary least squares (OLS)
Transformers also display better resilience to mild distribution shifts, where set-based distributions falter.
arXiv Detail & Related papers (2023-05-26T07:47:21Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.