Native Hybrid Attention for Efficient Sequence Modeling
- URL: http://arxiv.org/abs/2510.07019v2
- Date: Sat, 11 Oct 2025 09:31:02 GMT
- Title: Native Hybrid Attention for Efficient Sequence Modeling
- Authors: Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng,
- Abstract summary: Native Hybrid Attention (NHA) is a novel hybrid architecture of linear and full attention.<n>A single textttsoftmax attention operation is applied over all keys and values, enabling per-token and per-head context-dependent weighting.<n> Experimental results show that NHA surpasses Transformers on recall-intensive and commonsense reasoning tasks.
- Score: 12.306252523159197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra \& inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \texttt{softmax attention} operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.
Related papers
- Improved state mixing in higher-order and block diagonal linear recurrent networks [16.116191916700554]
Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks.<n>Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly.<n>Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency.
arXiv Detail & Related papers (2026-02-12T14:51:59Z) - MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts [27.8245634187787]
We present HALO, a pipeline for distilling Transformer models into RNN-attention hybrid models.<n>We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme.<n>The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data.
arXiv Detail & Related papers (2026-01-29T18:59:53Z) - SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention [50.99430451151184]
Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy.<n>We present an analytical study that contrasts linear and softmax attention for visual representation learning.<n>We propose SoLA-Vision, a flexible layer-wise hybrid attention backbone.
arXiv Detail & Related papers (2026-01-16T10:26:53Z) - Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction [3.9660062354591754]
Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity limits practical deployment.<n> Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation.<n>We introduce a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task.<n>This yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.
arXiv Detail & Related papers (2026-01-16T02:01:40Z) - Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling [0.0]
Gated Associative Memory (GAM) network is a novel, fully parallel architecture for sequence modeling.<n>We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline.<n>Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets.
arXiv Detail & Related papers (2025-08-30T20:59:46Z) - Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction [0.9282594860064428]
This study presents a new deep learning Hybrid LSTM-Trans-Mixer-Att framework for rate of Penetration prediction.<n>The proposed framework combines sequential memory, static feature interactions, global context learning, and dynamic feature weighting.<n> Experimental validation on real-world drilling datasets demonstrates superior performance, achieving an Rsquare of 0.9991 and a MAPE of 1.447%.
arXiv Detail & Related papers (2025-08-07T09:45:56Z) - Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z) - HybridTM: Combining Transformer and Mamba for 3D Semantic Segmentation [7.663855540620183]
We propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation.<n>In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity.<n>Our HybridTM achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks.
arXiv Detail & Related papers (2025-07-24T16:48:50Z) - A Systematic Analysis of Hybrid Linear Attention [11.722015123070957]
Linear models often suffer from limited recall performance.<n>Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models.<n>Our models are open-sourced at https://huggingface.co/collections/m-hugging-a-p/hybrid-linear-attention-research-686c488a63d609d2f2 0e2b1e.
arXiv Detail & Related papers (2025-07-08T23:54:11Z) - A Scalable Hybrid Training Approach for Recurrent Spiking Neural Networks [13.220581846415957]
In this work, we introduce HYbrid PRopagation (HYPR) that combines the efficiency of parallelization with approximate online forward learning.<n>HYPR enables parallelization of parameter update over the sub sequences for RSNNs consisting of almost arbitrary non-linear spiking neuron models.<n>We find that this type of neuron model is particularly well trainable by HYPR, resulting in an unprecedentedly low task performance gap between approximate forward gradient learning and BPTT.
arXiv Detail & Related papers (2025-06-17T12:27:25Z) - DLF: Enhancing Explicit-Implicit Interaction via Dynamic Low-Order-Aware Fusion for CTR Prediction [71.41414150295702]
We propose a novel framework, Dynamic Low-Order-Aware Fusion (DLF), for modeling click-through rate (CTR) prediction.<n>RLI preserves low-order signals while mitigating redundancy from residual connections, and NAF dynamically integrates explicit and implicit representations at each layer, enhancing information sharing.<n>Experiments on public datasets demonstrate that DLF achieves state-of-the-art performance in CTR prediction, addressing key limitations of existing models.
arXiv Detail & Related papers (2025-05-25T15:05:00Z) - Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z) - Parallel Sequence Modeling via Generalized Spatial Propagation Network [80.66202109995726]
Generalized Spatial Propagation Network (GSPN) is a new attention mechanism for optimized vision tasks that inherently captures 2D spatial structures.<n>GSPN overcomes limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach.<n>GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation.
arXiv Detail & Related papers (2025-01-21T18:56:19Z) - CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.<n>Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - TriMLP: Revenge of a MLP-like Architecture in Sequential Recommendation [23.32537260687907]
We present a sequential-like architecture for sequential recommendation, namely TriMLP, with a novel Triangular Mixer for cross-token communications.
In designing Triangular Mixer, we simplify the cross-token operation inascii as the basic matrix multiplication, and drop the lower-triangle neurons of the weight matrix to block the anti-chronological order connections from future tokens.
arXiv Detail & Related papers (2023-05-24T03:32:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.