Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
- URL: http://arxiv.org/abs/2601.22156v1
- Date: Thu, 29 Jan 2026 18:59:53 GMT
- Title: Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
- Authors: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu,
- Abstract summary: We present HALO, a pipeline for distilling Transformer models into RNN-attention hybrid models.<n>We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme.<n>The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data.
- Score: 27.8245634187787
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
Related papers
- MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - Native Hybrid Attention for Efficient Sequence Modeling [12.306252523159197]
Native Hybrid Attention (NHA) is a novel hybrid architecture of linear and full attention.<n>A single textttsoftmax attention operation is applied over all keys and values, enabling per-token and per-head context-dependent weighting.<n> Experimental results show that NHA surpasses Transformers on recall-intensive and commonsense reasoning tasks.
arXiv Detail & Related papers (2025-10-08T13:44:57Z) - Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data [53.040873127309766]
We propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning.<n>Our method outperforms existing models on both in-dataset and cross-dataset evaluations.
arXiv Detail & Related papers (2025-09-08T17:58:06Z) - HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization [25.87557024380553]
We propose a simple yet effective hybrid normalization strategy that integrates the advantages of Pre-Norm and Post-Norm.<n>In experiments on large-scale transformer models, we show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches.<n>These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models.
arXiv Detail & Related papers (2025-03-06T16:40:48Z) - LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation [37.21518386315535]
Scaling language models to handle longer contexts introduces substantial memory challenges.<n>We propose LightTransfer, a method that transforms models such as LLaMA into hybrid variants.<n>Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention.
arXiv Detail & Related papers (2024-10-17T17:58:14Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.<n>The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks.<n>We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - Linearizing Large Language Models [26.94551511277412]
We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget.
We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models.
arXiv Detail & Related papers (2024-05-10T17:59:08Z) - Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions [101.08706223326928]
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers.
In this paper, we seek to enable $mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture.
arXiv Detail & Related papers (2023-10-28T18:40:03Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.