Vanilla Transformers are Transfer Capability Teachers
- URL: http://arxiv.org/abs/2403.01994v1
- Date: Mon, 4 Mar 2024 12:40:28 GMT
- Title: Vanilla Transformers are Transfer Capability Teachers
- Authors: Xin Lu, Yanyan Zhao, Bing Qin
- Abstract summary: We propose that the pre-training performance and transfer capability of a model are joint determinants of its downstream task performance.
MoE models, in comparison to vanilla models, have poorer transfer capability, leading to their subpar performance in downstream tasks.
The MoE models guided by vanilla models can achieve both strong pre-training performance and transfer capability, ultimately enhancing their performance in downstream tasks.
- Score: 34.24324719229975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Mixture of Experts (MoE) Transformers have garnered increasing
attention due to their advantages in model capacity and computational
efficiency. However, studies have indicated that MoE Transformers underperform
vanilla Transformers in many downstream tasks, significantly diminishing the
practical value of MoE models. To explain this issue, we propose that the
pre-training performance and transfer capability of a model are joint
determinants of its downstream task performance. MoE models, in comparison to
vanilla models, have poorer transfer capability, leading to their subpar
performance in downstream tasks. To address this issue, we introduce the
concept of transfer capability distillation, positing that although vanilla
models have weaker performance, they are effective teachers of transfer
capability. The MoE models guided by vanilla models can achieve both strong
pre-training performance and transfer capability, ultimately enhancing their
performance in downstream tasks. We design a specific distillation method and
conduct experiments on the BERT architecture. Experimental results show a
significant improvement in downstream performance of MoE models, and many
further evidences also strongly support the concept of transfer capability
distillation. Finally, we attempt to interpret transfer capability distillation
and provide some insights from the perspective of model feature.
Related papers
- Towards a Deeper Understanding of Transformer for Residential Non-intrusive Load Monitoring [0.0]
This study delves into the effects of the number of hidden dimensions in the attention layer, the number of attention layers, the number of attention heads, and the dropout ratio on transformer performance.
It is expected that this work will serve as a foundation for future research and development of more robust and capable transformer models.
arXiv Detail & Related papers (2024-10-02T09:14:50Z) - Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts.
We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z) - TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning [6.329214318116305]
We propose a memory-efficient Temporal Difference Side Network ( TDS-CLIP) to balance knowledge transferring and temporal modeling.
Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features.
We also designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos.
arXiv Detail & Related papers (2024-08-20T09:40:08Z) - Exploring Model Transferability through the Lens of Potential Energy [78.60851825944212]
Transfer learning has become crucial in computer vision tasks due to the vast availability of pre-trained deep learning models.
Existing methods for measuring the transferability of pre-trained models rely on statistical correlations between encoded static features and task labels.
We present an insightful physics-inspired approach named PED to address these challenges.
arXiv Detail & Related papers (2023-08-29T07:15:57Z) - An Empirical Study on the Transferability of Transformer Modules in
Parameter-Efficient Fine-Tuning [18.69409646532038]
We investigate the capability of different transformer modules in transferring knowledge from a pre-trained model to a downstream task.
LayerNorms exhibit the best capacity for knowledge transfer with limited trainable weights.
arXiv Detail & Related papers (2023-02-01T11:20:18Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.