Rethinking the Value of Transformer Components
- URL: http://arxiv.org/abs/2011.03803v1
- Date: Sat, 7 Nov 2020 16:31:45 GMT
- Title: Rethinking the Value of Transformer Components
- Authors: Wenxuan Wang and Zhaopeng Tu
- Abstract summary: We evaluate the impact of individual component (sub-layer) in trained Transformer models from different perspectives.
We propose a new training strategy that can improve translation performance by distinguishing the unimportant components in training.
- Score: 45.841272820008264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer becomes the state-of-the-art translation model, while it is not
well studied how each intermediate component contributes to the model
performance, which poses significant challenges for designing optimal
architectures. In this work, we bridge this gap by evaluating the impact of
individual component (sub-layer) in trained Transformer models from different
perspectives. Experimental results across language pairs, training strategies,
and model capacities show that certain components are consistently more
important than the others. We also report a number of interesting findings that
might help humans better analyze, understand and improve Transformer models.
Based on these observations, we further propose a new training strategy that
can improves translation performance by distinguishing the unimportant
components in training.
Related papers
- Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer [0.0]
This paper studies a text classification algorithm based on an improved Transformer to improve the performance and efficiency of the model in text classification tasks.
The improved Transformer model outperforms the comparative models such as BiLSTM, CNN, standard Transformer, and BERT in terms of classification accuracy, F1 score, and recall rate.
arXiv Detail & Related papers (2025-01-23T08:32:27Z) - How Truncating Weights Improves Reasoning in Language Models [49.80959223722325]
We study how certain global associations tend to be stored in specific weight components or Transformer blocks.
We analyze how this arises during training, both empirically and theoretically.
arXiv Detail & Related papers (2024-06-05T08:51:08Z) - How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning.
We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z) - Affine transformation estimation improves visual self-supervised
learning [4.40560654491339]
We show that adding a module to constrain the representations to be predictive of an affine transformation improves the performance and efficiency of the learning process.
We perform experiments in various modern self-supervised models and see a performance improvement in all cases.
arXiv Detail & Related papers (2024-02-14T10:32:58Z) - A Meta-Learning Perspective on Transformers for Causal Language Modeling [17.293733942245154]
The Transformer architecture has become prominent in developing large causal language models.
We establish a meta-learning view of the Transformer architecture when trained for the causal language modeling task.
Within the inner optimization, we discover and theoretically analyze a special characteristic of the norms of learned token representations within Transformer-based causal language models.
arXiv Detail & Related papers (2023-10-09T17:27:36Z) - Demystify Transformers & Convolutions in Modern Image Deep Networks [82.32018252867277]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.
We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.
Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs.
arXiv Detail & Related papers (2022-11-10T18:59:43Z) - Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.