Related papers: Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

URL: http://arxiv.org/abs/2410.06577v1
Date: Wed, 9 Oct 2024 06:22:36 GMT
Title: Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
Authors: Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin,
Abstract summary: Rodimus is a new type of attention system for Transformer-based large language models (LLMs) Rodimus employs a data-dependent tempered selection mechanism within a linear attention-based, purely recurrent framework. Our experiments demonstrate that Rodimus$+$-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens.
Score: 26.025283259518936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a $O(T)$ complexity for per-token generation, where $T$ represents the context length. This work explores reducing LLMs' complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus$+$. Rodimus employs an innovative data-dependent tempered selection (DDTS) mechanism within a linear attention-based, purely recurrent framework, achieving significant accuracy while drastically reducing the memory usage typically associated with recurrent models. This method exemplifies semantic compression by maintaining essential input information with fixed-size hidden states. Building on this, Rodimus$+$ combines Rodimus with the innovative Sliding Window Shared-Key Attention (SW-SKA) in a hybrid approach, effectively leveraging the complementary semantic, token, and head compression techniques. Our experiments demonstrate that Rodimus$+$-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. Model code and pre-trained checkpoints will be available soon.

Related papers

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models [8.817690876855728]
We propose a structured pruning method, SPAT ($textbfS$ensitivity $textbfP$runer for $textbfAt$tention), which selectively removes attention mechanisms and yields highly effective models.<n>Experiments on datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs.
arXiv Detail & Related papers (2025-05-13T17:39:31Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Saliency-driven Dynamic Token Pruning for Large Language Models [32.903622070917194]
Saliency-driven Dynamic Token Pruning (SDTP) A lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state. A ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score.
arXiv Detail & Related papers (2025-04-06T15:15:07Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs. We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms. Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
Training a neural netwok for data reduction and better generalization [7.545668088790516]
The motivation for sparse learners is to compress the inputs (features) by selecting only the ones needed for good generalization. We show a remarkable phase transition from ignoring irrelevant features to retrieving them well as good thanks to the choice of artificial features. This approach can be seen as a form of sensing for compressed features to interpret high-dimensional data into a compact, interpretable subset of meaningful penalties.
arXiv Detail & Related papers (2024-11-26T07:41:15Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning [4.051777802443125]
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations. We introduce Gradient SAEs, which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function. We find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts.
arXiv Detail & Related papers (2024-11-15T18:03:52Z)
Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training [3.195234044113248]
We propose textscNeuroAL, a emphtop-up algorithm for network pruning. It modifies the block-wise and row-wise sparsity exploiting information from both the dense model and its sparse version. It consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off.
arXiv Detail & Related papers (2024-11-11T15:30:16Z)
NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning. MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z)
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning [54.682106515794864]
offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. This paper introduces $textbfLanguage Models for $textbfMo$tion Control ($textbfLaMo$), a general framework based on Decision Transformers to use pre-trained Language Models (LMs) for offline RL. Empirical results indicate $textbfLaMo$ achieves state-of-the-art performance in sparse-reward tasks.
arXiv Detail & Related papers (2023-10-31T16:24:17Z)
CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference [36.753384415107774]
Scaling language models to larger and deeper sizes has led to significant boosts in performance. We propose CoTFormer, a novel architecture which closely mimics Chain-of-Thought (CoT) at the token level. We show that it is possible to reduce the computation cost significantly without any reduction in accuracy.
arXiv Detail & Related papers (2023-10-16T21:37:34Z)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z)
Binarizing by Classification: Is soft function really necessary? [4.329951775163721]
We propose to tackle network binarization as a binary classification problem. We also take binarization as a lightweighting approach for pose estimation models. The proposed method enables binary networks to achieve a mAP of up to $60.6$ for the first time.
arXiv Detail & Related papers (2022-05-16T02:47:41Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.