Bird-Eye Transformers for Text Generation Models
- URL: http://arxiv.org/abs/2210.03985v1
- Date: Sat, 8 Oct 2022 09:51:15 GMT
- Title: Bird-Eye Transformers for Text Generation Models
- Authors: Lei Sha, Yuhang Song, Yordan Yordanov, Tommaso Salvatori, Thomas
Lukasiewicz
- Abstract summary: We propose a new architecture, called bird-eye transformer(BET), which goes one step further to improve the performance of transformers.
Our proposed model achieves a better performance than the baseline transformer architectures onalldatasets.
- Score: 49.47825106383972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have become an indispensable module for text generation models
since their great success in machine translation. Previous works attribute
the~success of transformers to the query-key-value dot-product attention, which
provides a robust inductive bias by the fully connected token graphs. However,
we found that self-attention has a severe limitation. When predicting the
(i+1)-th token, self-attention only takes the i-th token as an information
collector, and it tends to give a high attention weight to those tokens similar
to itself. Therefore, most of the historical information that occurred before
the i-th token is not taken into consideration. Based on this observation, in
this paper, we propose a new architecture, called bird-eye transformer(BET),
which goes one step further to improve the performance of transformers by
reweighting self-attention to encourage it to focus more on important
historical information. We have conducted experiments on multiple text
generation tasks, including machine translation (2 datasets) and language
models (3 datasets). These experimental~results show that our proposed model
achieves a better performance than the baseline transformer architectures
on~all~datasets. The code is released at:
\url{https://sites.google.com/view/bet-transformer/home}.
Related papers
- Local to Global: Learning Dynamics and Effect of Initialization for Transformers [20.02103237675619]
We focus on first-order Markov chains and single-layer transformers.
We prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima.
arXiv Detail & Related papers (2024-06-05T08:57:41Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - When to Use Efficient Self Attention? Profiling Text, Speech and Image
Transformer Variants [39.00433193973159]
We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision.
We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models.
To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model.
arXiv Detail & Related papers (2023-06-14T17:59:02Z) - What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention.
What makes for a good tokenizer has not been well understood in computer vision.
Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization.
Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z) - Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.