Related papers: Untangling Component Imbalance in Hybrid Linear Attention Conversion Methods

Untangling Component Imbalance in Hybrid Linear Attention Conversion Methods

URL: http://arxiv.org/abs/2510.05901v2
Date: Fri, 10 Oct 2025 17:42:09 GMT
Title: Untangling Component Imbalance in Hybrid Linear Attention Conversion Methods
Authors: Martin Benfeghoul, Teresa Delgado, Adnan Oomerjee, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas,
Abstract summary: Post-training linearisation methods convert pre-trained Transformers to linear models efficiently.<n>We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component.<n>We propose three solutions to ensure balanced component usage.
Score: 14.82822709954587
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers' quadratic computational complexity limits their scalability despite remarkable performance. While linear attention reduces this to linear complexity, pre-training such models from scratch remains, in most cases, prohibitively expensive. Recent post-training linearisation methods convert pre-trained Transformers to linear models efficiently, often using hybrid approaches that combine linear attention with sliding-window softmax. We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component, relying almost entirely on SWA. Component-level diagnostics reveal this previously undetected behaviour stems from overlooked evaluation practices on common-sense benchmarks. We propose three solutions to ensure balanced component usage: (i) inference-time hybridisation of linear-only conversions with sliding-window softmax; (ii) HedgeCATs, combining attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled Sliding-window Dropout (SSD), which stochastically suppresses the softmax branch during training to prevent component collapse. Our methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption, restoring the validity of performance attributions in hybrid conversions.

Related papers

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation [57.57816409869894]
We introduce POET-X, a scalable and memory-efficient variant for training large language models.<n>PoET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency.
arXiv Detail & Related papers (2026-03-05T18:59:23Z)
GPU-friendly and Linearly Convergent First-order Methods for Certifying Optimal $k$-sparse GLMs [7.079949618914198]
Branch-and-Bound (BnB) frameworks can certify optimality using perspective relaxations.<n>Existing methods for solving these relaxations are computationally intensive, limiting their scalability.<n>We develop a unified proximal framework that is both linearly convergent and computationally efficient.
arXiv Detail & Related papers (2026-03-01T22:26:09Z)
Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction [3.9660062354591754]
Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity limits practical deployment.<n> Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation.<n>We introduce a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task.<n>This yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.
arXiv Detail & Related papers (2026-01-16T02:01:40Z)
Distilling to Hybrid Attention Models via KL-Guided Layer Selection [66.06591032073744]
This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data.<n>We find that this approach is more effective than existing approaches for layer selection, including approaches that uniformly interleave linear attentions based on a fixed ratio.
arXiv Detail & Related papers (2025-12-23T18:12:22Z)
Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers [27.14203097630326]
We introduce a latent space transition operator and propose Sequential Learning with Drift Compensation.<n>SLDC aims to align feature distributions across tasks to mitigate the impact of drift.<n>Experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT.
arXiv Detail & Related papers (2025-11-13T03:40:54Z)
A Trainable Optimizer [18.195022468462753]
We present a framework that jointly trains the full gradient estimator and the trainable weights of the model.<n>Pseudo-linear TO incurs negligible computational overhead, requiring only minimal additional multiplications.<n> Experiments demonstrate that TO methods converge faster than benchmark algorithms.
arXiv Detail & Related papers (2025-08-03T14:06:07Z)
Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update [60.414548453838506]
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function.<n>GLBs are widely applicable to real-world scenarios, but their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency.<n>We propose a jointly efficient algorithm that attains a nearly optimal regret bound with $mathcalO(1)$ time and space complexities per round.
arXiv Detail & Related papers (2025-07-16T02:24:21Z)
Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency [37.02934235737917]
We propose a principled method to determine the feature dimension in linear attention using the concept of statistical degrees of freedom.<n>We show that our method achieves smaller error under a fixed computational budget.<n>Our findings also provide insight into how the complexity of the attention mechanism evolves across layers.
arXiv Detail & Related papers (2025-07-04T06:59:17Z)
Fast and Stable Diffusion Planning through Variational Adaptive Weighting [3.745003761050674]
Diffusion models have recently shown promise in offline RL.<n>These methods often suffer from high training costs and slow convergence.<n>We introduce a closed-form approximation method for its online estimation under the flow-based generative modeling framework.<n> Experimental results on Maze2D and Kitchen tasks show that our method achieves competitive performance with up to 10 times fewer training steps.
arXiv Detail & Related papers (2025-06-20T02:12:04Z)
Mechanistic Insights into Grokking from the Embedding Layer [15.676058752772287]
Grokking, a delayed generalization in neural networks, has been observed in Transformers and stagnates, but the components driving it remain underexplored.<n>We show that embeddings are central to grokking: introducing them intos induces delayed generalization in modular arithmetic tasks.<n>Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
arXiv Detail & Related papers (2025-05-21T15:12:34Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.<n>We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z)
Robust optimization for adversarial learning with finite sample complexity guarantees [1.8434042562191815]
In this paper we focus on linear and nonlinear classification problems and propose a novel adversarial training method for robust classifiers. We view robustness under a data driven lens, and derive finite sample complexity bounds for both linear and non-linear classifiers in binary and multi-class scenarios. Our algorithm minimizes a worst-case surrogate loss using Linear Programming (LP) and Second Order Cone Programming (SOCP) for linear and non-linear models.
arXiv Detail & Related papers (2024-03-22T13:49:53Z)
FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity. Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead. We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning. LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.