A Systematic Analysis of Hybrid Linear Attention
- URL: http://arxiv.org/abs/2507.06457v1
- Date: Tue, 08 Jul 2025 23:54:11 GMT
- Title: A Systematic Analysis of Hybrid Linear Attention
- Authors: Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian,
- Abstract summary: Linear models often suffer from limited recall performance.<n>Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models.<n>Our models are open-sourced at https://huggingface.co/collections/m-hugging-a-p/hybrid-linear-attention-research-686c488a63d609d2f2 0e2b1e.
- Score: 11.722015123070957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.
Related papers
- Lizard: An Efficient Linearization Framework for Large Language Models [100.63879229649581]
We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation.<n>Lizard addresses the limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality.<n>We show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods.
arXiv Detail & Related papers (2025-07-11T21:19:18Z) - Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z) - HyMamba: Mamba with Hybrid Geometry-Feature Coupling for Efficient Point Cloud Classification [7.139631485661567]
HyMamba is a geometry and feature coupled Mamba framework featuring: (1) Geometry-Feature Coupled Pooling (GFCP), which dynamically aggregating adjacent geometric information into local features; (2) Collaborative Feature Enhancer (CoFE), which enhances sparse signal capture through cross-path feature hybridization;.<n>The proposed model achieves superior classification performance, particularly on the ModelNet40 dataset, where it elevates accuracy to 95.99% with merely 0.03M additional parameters. Furthermore, it attains 98.9% accuracy on the ModelNetShot dataset, validating its robust generalization capabilities under sparse samples.
arXiv Detail & Related papers (2025-05-16T10:30:20Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z) - CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.<n>Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z) - Scaling Laws for Linear Complexity Language Models [18.787664489713332]
We present the scaling laws for linear complexity language models to establish a foundation for their scalability.
The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models.
arXiv Detail & Related papers (2024-06-24T14:51:31Z) - Hybrid State Space-based Learning for Sequential Data Prediction with
Joint Optimization [0.0]
We introduce a hybrid model that mitigates, via a joint mechanism, the need for domain-specific feature engineering issues of conventional nonlinear prediction models.
We achieve this by introducing novel state space representations for the base models, which are then combined to provide a full state space representation of the hybrid or the ensemble.
Due to such novel combination and joint optimization, we demonstrate significant improvements in widely publicized real life competition datasets.
arXiv Detail & Related papers (2023-09-19T12:00:28Z) - Learning Bijective Feature Maps for Linear ICA [73.85904548374575]
We show that existing probabilistic deep generative models (DGMs) which are tailor-made for image data, underperform on non-linear ICA tasks.
To address this, we propose a DGM which combines bijective feature maps with a linear ICA model to learn interpretable latent structures for high-dimensional data.
We create models that converge quickly, are easy to train, and achieve better unsupervised latent factor discovery than flow-based models, linear ICA, and Variational Autoencoders on images.
arXiv Detail & Related papers (2020-02-18T17:58:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.