Related papers: Go Wider Instead of Deeper

Go Wider Instead of Deeper

URL: http://arxiv.org/abs/2107.11817v2
Date: Thu, 29 Jul 2021 10:17:23 GMT
Title: Go Wider Instead of Deeper
Authors: Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, Yang You
Abstract summary: We propose a framework to deploy trainable parameters efficiently, by going wider instead of deeper. Our best model outperforms Vision Transformer (ViT) by $1.46%$ with $0.72 times$ trainable parameters. Our framework can still surpass ViT and ViT-MoE by $0.83%$ and $2.08%$, respectively.
Score: 11.4541055228727
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transformer has recently achieved impressive results on various tasks. To further improve the effectiveness and efficiency of the transformer, there are two trains of thought among existing works: (1) going wider by scaling to more trainable parameters; (2) going shallower by parameter sharing or model compressing along with the depth. However, larger models usually do not scale well when fewer tokens are available to train, and advanced parallelisms are required when the model is extremely large. Smaller models usually achieve inferior performance compared to the original transformer model due to the loss of representation power. In this paper, to achieve better performance with fewer trainable parameters, we propose a framework to deploy trainable parameters efficiently, by going wider instead of deeper. Specially, we scale along model width by replacing feed-forward network (FFN) with mixture-of-experts (MoE). We then share the MoE layers across transformer blocks using individual layer normalization. Such deployment plays the role to transform various semantic representations, which makes the model more parameter-efficient and effective. To evaluate our framework, we design WideNet and evaluate it on ImageNet-1K. Our best model outperforms Vision Transformer (ViT) by $1.46\%$ with $0.72 \times$ trainable parameters. Using $0.46 \times$ and $0.13 \times$ parameters, our WideNet can still surpass ViT and ViT-MoE by $0.83\%$ and $2.08\%$, respectively.

Related papers

Merging Feed-Forward Sublayers for Compressed Transformers [16.746335565636976]
We present a novel approach to model compression by merging similar parameter groups within a model. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models. We demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers.
arXiv Detail & Related papers (2025-01-10T17:25:11Z)
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [38.30350849992281]
"Recursive" language models share parameters across layers with minimal loss of performance. Recursive Transformers are efficiently from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We show that our models outperform both similar-sized vanilla pretrained models and knowledge distillation baselines.
arXiv Detail & Related papers (2024-10-28T02:15:45Z)
MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B. We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z)
Multi-Path Transformer is Better: A Case Study on Neural Machine Translation [35.67070351304121]
We study how model width affects the Transformer model through a parameter-efficient multi-path structure. Experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model.
arXiv Detail & Related papers (2023-05-10T07:39:57Z)
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z)
MiniViT: Compressing Vision Transformers with Weight Multiplexing [88.54212027516755]
Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability. MiniViT is a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance.
arXiv Detail & Related papers (2022-04-14T17:59:05Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
Sliced Recursive Transformer [23.899076070924153]
Recursive operation on vision transformers can improve parameter utilization without involving additional parameters. Our model Sliced Recursive Transformer (SReT) is compatible with a broad range of other designs for efficient vision transformers.
arXiv Detail & Related papers (2021-11-09T17:59:14Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
Recurrent multiple shared layers in Depth for Neural Machine Translation [11.660776324473645]
We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction. Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our model has similar model performance and infer speed, but our model parameters are 54.72% of the former.
arXiv Detail & Related papers (2021-08-23T21:21:45Z)
Recurrent Parameter Generators [42.159272098922685]
We present a generic method for recurrently using the same parameters for many different convolution layers to build a deep network. We demonstrate how to build a one-layer neural network to achieve similar performance compared to other traditional CNN models.
arXiv Detail & Related papers (2021-07-15T04:23:59Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.