Related papers: SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling

SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling

URL: http://arxiv.org/abs/2309.12578v1
Date: Fri, 22 Sep 2023 02:14:46 GMT
Title: SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling
Authors: Bokyeong Yoon, Yoonsang Han, Gordon Euhyun Moon
Abstract summary: We propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method. Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training. New SPION achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models.
Score: 1.0128808054306186
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as the potential loss of essential sequence features due to the uniform fixed pattern applied across all layers, and an increase in the model size resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating a new SPION that achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models, with better evaluation quality.

Related papers

Chain-of-Thought Enhanced Shallow Transformers for Wireless Symbol Detection [14.363929799618283]
We propose CHain Of thOught Symbol dEtection (CHOOSE), a CoT-enhanced shallow Transformer framework for wireless symbol detection.<n>By introducing autoregressive latent reasoning steps within the hidden space, CHOOSE significantly improves the reasoning capacity of shallow models.<n> Experimental results demonstrate that our approach outperforms conventional shallow Transformers and achieves performance comparable to that of deep Transformers.
arXiv Detail & Related papers (2025-06-26T08:41:45Z)
A temporal scale transformer framework for precise remaining useful life prediction in fuel cells [10.899223392837936]
Temporal Scale Transformer (TSTransformer) is an enhanced version of the inverted Transformer (iTransformer) Unlike traditional Transformers that treat each timestep as an input token, TSTransformer maps sequences of varying lengths into tokens at different stages for inter-sequence modeling. It improves local feature extraction, captures temporal scale characteristics, and reduces token count and computational costs.
arXiv Detail & Related papers (2025-04-08T23:42:54Z)
RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals [2.287772422489548]
We propose RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification.
arXiv Detail & Related papers (2025-02-18T09:34:31Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed. Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
A Neural ODE Interpretation of Transformer Layers [8.839601328192957]
Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems. We build upon this connection and propose a modification of the internal architecture of a transformer layer. Our experiments show that this simple modification improves the performance of transformer networks in multiple tasks.
arXiv Detail & Related papers (2022-12-12T16:18:58Z)
Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization [31.28396970291575]
Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. We first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the emphmomentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities.
arXiv Detail & Related papers (2022-08-01T02:37:49Z)
Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z)
Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU [19.103130032967663]
Incremental processing allows interactive systems to respond based on partial inputs. Recent work attempts to apply Transformers incrementally via restart-incrementality. This approach is computationally costly and does not scale efficiently for long sequences.
arXiv Detail & Related papers (2021-09-15T15:20:29Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers. When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.