Related papers: Distilling to Hybrid Attention Models via KL-Guided Layer Selection

Distilling to Hybrid Attention Models via KL-Guided Layer Selection

URL: http://arxiv.org/abs/2512.20569v1
Date: Tue, 23 Dec 2025 18:12:22 GMT
Title: Distilling to Hybrid Attention Models via KL-Guided Layer Selection
Authors: Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, Yoon Kim,
Abstract summary: This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data.<n>We find that this approach is more effective than existing approaches for layer selection, including approaches that uniformly interleave linear attentions based on a fixed ratio.
Score: 66.06591032073744
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.

Related papers

STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs [23.745366354566315]
Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms.<n>We propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs.
arXiv Detail & Related papers (2026-02-02T14:49:18Z)
Untangling Component Imbalance in Hybrid Linear Attention Conversion Methods [14.82822709954587]
Post-training linearisation methods convert pre-trained Transformers to linear models efficiently.<n>We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component.<n>We propose three solutions to ensure balanced component usage.
arXiv Detail & Related papers (2025-10-07T13:11:13Z)
Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency [37.02934235737917]
We propose a principled method to determine the feature dimension in linear attention using the concept of statistical degrees of freedom.<n>We show that our method achieves smaller error under a fixed computational budget.<n>Our findings also provide insight into how the complexity of the attention mechanism evolves across layers.
arXiv Detail & Related papers (2025-07-04T06:59:17Z)
GeneralizeFormer: Layer-Adaptive Model Generation across Test-Time Distribution Shifts [58.95913531746308]
We consider the problem of test-time domain generalization, where a model is trained on several source domains and adjusted on target domains never seen during training.<n>We propose to generate multiple layer parameters on the fly during inference by a lightweight meta-learned transformer, which we call textitGeneralizeFormer.
arXiv Detail & Related papers (2025-02-15T10:10:49Z)
Transformer-Driven Active Transfer Learning for Cross-Hyperspectral Image Classification [3.087068801861429]
Hyperspectral image (HSI) classification presents inherent challenges due to high spectral dimensionality, significant domain shifts, and limited availability of labeled data.<n>We propose a novel Active Transfer Learning (ATL) framework built upon a Spatial-Spectral Transformer (SST) backbone.<n>The framework integrates multistage transfer learning with an uncertainty-diversity-driven active learning mechanism.
arXiv Detail & Related papers (2024-11-27T07:53:39Z)
Exploring Selective Layer Fine-Tuning in Federated Learning [48.470385357429215]
Federated learning (FL) has emerged as a promising paradigm for fine-tuning foundation models using distributed data. We study selective layer fine-tuning in FL, emphasizing a flexible approach that allows the clients to adjust their selected layers according to their local data and resources.
arXiv Detail & Related papers (2024-08-28T07:48:39Z)
SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences. Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation. Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z)
Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination. Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z)
PALM: Pushing Adaptive Learning Rate Mechanisms for Continual Test-Time Adaptation [6.181548939188321]
Real-world vision models in dynamic environments face rapid shifts in domain distributions, leading to decreased recognition performance.<n>We propose continuous test-time adaptation (CTTA) to adjust a pre-trained source discriminative model to these changing domains.<n>We conduct extensive image classification experiments on CIFAR-10C, CIFAR-100C, and ImageNet-C, demonstrating the superior efficacy of our method compared to prior approaches.
arXiv Detail & Related papers (2024-03-15T19:35:10Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score. LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z)
A block coordinate descent optimizer for classification problems exploiting convexity [0.0]
We introduce a coordinate descent method to deep linear networks for classification tasks that exploits convexity of the cross-entropy loss in the weights of the hidden layer. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training.
arXiv Detail & Related papers (2020-06-17T19:49:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.