SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood
Filling
- URL: http://arxiv.org/abs/2309.12578v1
- Date: Fri, 22 Sep 2023 02:14:46 GMT
- Title: SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood
Filling
- Authors: Bokyeong Yoon, Yoonsang Han, Gordon Euhyun Moon
- Abstract summary: We propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method.
Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training.
New SPION achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models.
- Score: 1.0128808054306186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparsifying the Transformer has garnered considerable interest, as training
the Transformer is very computationally demanding. Prior efforts to sparsify
the Transformer have either used a fixed pattern or data-driven approach to
reduce the number of operations involving the computation of multi-head
attention, which is the main bottleneck of the Transformer. However, existing
methods suffer from inevitable problems, such as the potential loss of
essential sequence features due to the uniform fixed pattern applied across all
layers, and an increase in the model size resulting from the use of additional
parameters to learn sparsity patterns in attention operations. In this paper,
we propose a novel sparsification scheme for the Transformer that integrates
convolution filters and the flood filling method to efficiently capture the
layer-wise sparse pattern in attention operations. Our sparsification approach
reduces the computational complexity and memory footprint of the Transformer
during training. Efficient implementations of the layer-wise sparsified
attention algorithm on GPUs are developed, demonstrating a new SPION that
achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer
models, with better evaluation quality.
Related papers
- Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - A Neural ODE Interpretation of Transformer Layers [8.839601328192957]
Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems.
We build upon this connection and propose a modification of the internal architecture of a transformer layer.
Our experiments show that this simple modification improves the performance of transformer networks in multiple tasks.
arXiv Detail & Related papers (2022-12-12T16:18:58Z) - Momentum Transformer: Closing the Performance Gap Between Self-attention
and Its Linearization [31.28396970291575]
Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy.
We first interpret the linear attention and residual connections in computing the attention map as gradient descent steps.
We then introduce momentum into these components and propose the emphmomentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities.
arXiv Detail & Related papers (2022-08-01T02:37:49Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU [19.103130032967663]
Incremental processing allows interactive systems to respond based on partial inputs.
Recent work attempts to apply Transformers incrementally via restart-incrementality.
This approach is computationally costly and does not scale efficiently for long sequences.
arXiv Detail & Related papers (2021-09-15T15:20:29Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.