Related papers: When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

URL: http://arxiv.org/abs/2306.08667v1
Date: Wed, 14 Jun 2023 17:59:02 GMT
Title: When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants
Authors: Anuj Diwan, Eunsol Choi, David Harwath
Abstract summary: We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision. We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models. To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model.
Score: 39.00433193973159
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision. We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models, using a variety of efficiency metrics (latency, throughput, and memory). To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model. We observe that these thresholds are (a) much higher than typical dataset sequence lengths and (b) dependent on the metric and modality, showing that choosing the right model depends on modality, task type (long-form vs. typical context) and resource constraints (time vs. memory). By visualising the breakdown of the computational costs for transformer components, we also show that non-self-attention components exhibit significant computational costs. We release our profiling toolkit at https://github.com/ajd12342/profiling-transformers .

Related papers

Parameter-Efficient Transformer Embeddings [0.0]
We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs.<n>We train standard transformers and our architecture on natural language inference tasks.<n>Our results demonstrate that the proposed method competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout.
arXiv Detail & Related papers (2025-05-04T21:47:18Z)
Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context. We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise. It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z)
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions. The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z)
A Transformer-based Framework For Multi-variate Time Series: A Remaining Useful Life Prediction Use Case [4.0466311968093365]
This work proposed an encoder-transformer architecture-based framework for time series prediction. We validated the effectiveness of the proposed framework on all four sets of the C-MAPPS benchmark dataset. To enable the model awareness of the initial stages of the machine life and its degradation path, a novel expanding window method was proposed.
arXiv Detail & Related papers (2023-08-19T02:30:35Z)
A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We introduce a relative position embedding to explicitly maximize attention resolution. We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z)
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers [4.635547236305835]
We propose an efficient design of Transformer-based models for time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer. PatchTST can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models.
arXiv Detail & Related papers (2022-11-27T05:15:42Z)
Transformer-F: A Transformer network with effective methods for learning universal sentence representation [8.225067988604351]
The Transformer model is widely used in natural language processing for sentence representation. In this paper, two approaches are introduced to improve the performance of Transformers.
arXiv Detail & Related papers (2021-07-02T03:20:11Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.