Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
- URL: http://arxiv.org/abs/2210.04243v1
- Date: Sun, 9 Oct 2022 12:27:25 GMT
- Title: Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
- Authors: Huanru Henry Mao
- Abstract summary: Autoregressive Transformers incur O(T) complexity during per-token generation due to self-attention mechanism.
Recent work proposes kernel-based methods to approximate causal self-attention.
We propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention's performance for GPT-2.
- Score: 1.1802674324027231
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive Transformers are strong language models but incur O(T)
complexity during per-token generation due to the self-attention mechanism.
Recent work proposes kernel-based methods to approximate causal self-attention
by replacing it with recurrent formulations with various update rules and
feature maps to achieve O(1) time and memory complexity. We explore these
approaches and find that they are unnecessarily complex, and propose a simple
alternative - decaying fast weights - that runs fast on GPU, outperforms prior
methods, and retains 99% of attention's performance for GPT-2. We also show
competitive performance on WikiText-103 against more complex attention
substitutes.
Related papers
- TNT: Improving Chunkwise Training for Test-Time Memorization [62.78875147721906]
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers.<n>We introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process.<n>TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration.
arXiv Detail & Related papers (2025-11-10T17:45:09Z) - InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation [56.694702609077495]
Long-sequence processing is a critical capability for modern large language models.<n>InfLLM-V2 is a trainable sparse attention framework that seamlessly adapts models from short to long sequences.<n>In experiments, InfLLM-V2 is 4$times$ faster than dense attention while retaining 98.1% and 99.7% of the performance.
arXiv Detail & Related papers (2025-09-29T12:08:33Z) - Fast attention mechanisms: a tale of parallelism [52.7657529272906]
We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity.<n>We prove that ANNA-transformers retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms.
arXiv Detail & Related papers (2025-09-10T20:59:44Z) - From TLinFormer to TConstFormer: The Leap to Constant-Time Transformer Attention: Achieving O(1) Computation and O(1) KV Cache during Autoregressive Inference [0.0]
TConstFormer employs an innovative periodic state update mechanism to achieve a truly constant-size O(1) KV Cache.<n>TConstFormer exhibits an overwhelming advantage over baseline models in terms of speed, memory efficiency, and overall performance on long-text inference tasks.
arXiv Detail & Related papers (2025-08-29T19:23:35Z) - AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction [36.239648954658534]
Time series forecasting requires architectures that simultaneously achieve three competing objectives.<n>We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges.<n> Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on P08.
arXiv Detail & Related papers (2025-06-19T03:47:04Z) - Decomposition-based multi-scale transformer framework for time series anomaly detection [0.9438207505148947]
We propose a transformer-based framework built on decomposition (TransDe) for time series anomaly detection.
A multi-scale patch-based transformer architecture is proposed to exploit the representative dependencies of each decomposed component of the time series.
A novel asynchronous loss function with a stop-gradient strategy is introduced to enhance the performance of TransDe effectively.
arXiv Detail & Related papers (2025-04-19T06:47:38Z) - Towards Transformer-Based Aligned Generation with Self-Coherence Guidance [51.42269790543461]
We introduce a training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs)
Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges.
Our method addresses these challenges by directly optimizing cross-attention maps during the generation process.
arXiv Detail & Related papers (2025-03-22T07:03:57Z) - Diffusion Auto-regressive Transformer for Effective Self-supervised Time Series Forecasting [47.58016750718323]
We propose a novel generative self-supervised method called TimeDART.
TimeDART captures both the global sequence dependence and local detail features within time series data.
Our code is publicly available at https://github.com/Melmaphother/TimeDART.
arXiv Detail & Related papers (2024-10-08T06:08:33Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - ReCycle: Fast and Efficient Long Time Series Forecasting with Residual Cyclic Transformers [0.06965384453064827]
Residual Cyclic Transformer, ReCycle, bridges the gap between high method complexity and realistic computational resources.
Our approach reduces the run time and energy consumption by more than an order of magnitude, making both training and inference feasible on low-performance, low-power and edge computing devices.
arXiv Detail & Related papers (2024-05-06T12:48:34Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism.
We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - TAPIR: Learning Adaptive Revision for Incremental Natural Language
Understanding with a Two-Pass Model [14.846377138993645]
Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers.
A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise.
We propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy.
arXiv Detail & Related papers (2023-05-18T09:58:19Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time.
We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.