Multi-Path Transformer is Better: A Case Study on Neural Machine
Translation
- URL: http://arxiv.org/abs/2305.05948v1
- Date: Wed, 10 May 2023 07:39:57 GMT
- Title: Multi-Path Transformer is Better: A Case Study on Neural Machine
Translation
- Authors: Ye Lin, Shuhan Zhou, Yanyang Li, Anxiang Ma, Tong Xiao, Jingbo Zhu
- Abstract summary: We study how model width affects the Transformer model through a parameter-efficient multi-path structure.
Experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model.
- Score: 35.67070351304121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For years the model performance in machine learning obeyed a power-law
relationship with the model size. For the consideration of parameter
efficiency, recent studies focus on increasing model depth rather than width to
achieve better performance. In this paper, we study how model width affects the
Transformer model through a parameter-efficient multi-path structure. To better
fuse features extracted from different paths, we add three additional
operations to each sublayer: a normalization at the end of each path, a cheap
operation to produce more features, and a learnable weighted mechanism to fuse
all features flexibly. Extensive experiments on 12 WMT machine translation
tasks show that, with the same number of parameters, the shallower multi-path
model can achieve similar or even better performance than the deeper model. It
reveals that we should pay more attention to the multi-path structure, and
there should be a balance between the model depth and width to train a better
large-scale Transformer.
Related papers
- Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers [34.611309081801345]
This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly.
We propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks.
We find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets.
arXiv Detail & Related papers (2024-04-15T17:55:43Z) - DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging [34.643717080240584]
We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size.
Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations.
Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models.
arXiv Detail & Related papers (2024-02-04T21:44:09Z) - Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth.
We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity.
Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - Ensemble Transformer for Efficient and Accurate Ranking Tasks: an
Application to Question Answering Systems [99.13795374152997]
We propose a neural network designed to distill an ensemble of large transformers into a single smaller model.
An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads.
Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
arXiv Detail & Related papers (2022-01-15T06:21:01Z) - Recurrent multiple shared layers in Depth for Neural Machine Translation [11.660776324473645]
We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction.
Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our model has similar model performance and infer speed, but our model parameters are 54.72% of the former.
arXiv Detail & Related papers (2021-08-23T21:21:45Z) - Go Wider Instead of Deeper [11.4541055228727]
We propose a framework to deploy trainable parameters efficiently, by going wider instead of deeper.
Our best model outperforms Vision Transformer (ViT) by $1.46%$ with $0.72 times$ trainable parameters.
Our framework can still surpass ViT and ViT-MoE by $0.83%$ and $2.08%$, respectively.
arXiv Detail & Related papers (2021-07-25T14:44:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.