Go Wider Instead of Deeper
- URL: http://arxiv.org/abs/2107.11817v2
- Date: Thu, 29 Jul 2021 10:17:23 GMT
- Title: Go Wider Instead of Deeper
- Authors: Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, Yang You
- Abstract summary: We propose a framework to deploy trainable parameters efficiently, by going wider instead of deeper.
Our best model outperforms Vision Transformer (ViT) by $1.46%$ with $0.72 times$ trainable parameters.
Our framework can still surpass ViT and ViT-MoE by $0.83%$ and $2.08%$, respectively.
- Score: 11.4541055228727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The transformer has recently achieved impressive results on various tasks. To
further improve the effectiveness and efficiency of the transformer, there are
two trains of thought among existing works: (1) going wider by scaling to more
trainable parameters; (2) going shallower by parameter sharing or model
compressing along with the depth. However, larger models usually do not scale
well when fewer tokens are available to train, and advanced parallelisms are
required when the model is extremely large. Smaller models usually achieve
inferior performance compared to the original transformer model due to the loss
of representation power. In this paper, to achieve better performance with
fewer trainable parameters, we propose a framework to deploy trainable
parameters efficiently, by going wider instead of deeper. Specially, we scale
along model width by replacing feed-forward network (FFN) with
mixture-of-experts (MoE). We then share the MoE layers across transformer
blocks using individual layer normalization. Such deployment plays the role to
transform various semantic representations, which makes the model more
parameter-efficient and effective. To evaluate our framework, we design WideNet
and evaluate it on ImageNet-1K. Our best model outperforms Vision Transformer
(ViT) by $1.46\%$ with $0.72 \times$ trainable parameters. Using $0.46 \times$
and $0.13 \times$ parameters, our WideNet can still surpass ViT and ViT-MoE by
$0.83\%$ and $2.08\%$, respectively.
Related papers
- Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [38.30350849992281]
"Recursive" language models share parameters across layers with minimal loss of performance.
Recursive Transformers are efficiently from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop.
We show that our models outperform both similar-sized vanilla pretrained models and knowledge distillation baselines.
arXiv Detail & Related papers (2024-10-28T02:15:45Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Multi-Path Transformer is Better: A Case Study on Neural Machine
Translation [35.67070351304121]
We study how model width affects the Transformer model through a parameter-efficient multi-path structure.
Experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model.
arXiv Detail & Related papers (2023-05-10T07:39:57Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - MiniViT: Compressing Vision Transformers with Weight Multiplexing [88.54212027516755]
Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability.
MiniViT is a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance.
arXiv Detail & Related papers (2022-04-14T17:59:05Z) - Sliced Recursive Transformer [23.899076070924153]
Recursive operation on vision transformers can improve parameter utilization without involving additional parameters.
Our model Sliced Recursive Transformer (SReT) is compatible with a broad range of other designs for efficient vision transformers.
arXiv Detail & Related papers (2021-11-09T17:59:14Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - Recurrent multiple shared layers in Depth for Neural Machine Translation [11.660776324473645]
We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction.
Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our model has similar model performance and infer speed, but our model parameters are 54.72% of the former.
arXiv Detail & Related papers (2021-08-23T21:21:45Z) - Recurrent Parameter Generators [42.159272098922685]
We present a generic method for recurrently using the same parameters for many different convolution layers to build a deep network.
We demonstrate how to build a one-layer neural network to achieve similar performance compared to other traditional CNN models.
arXiv Detail & Related papers (2021-07-15T04:23:59Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.