Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts
- URL: http://arxiv.org/abs/2309.04354v1
- Date: Fri, 8 Sep 2023 14:24:10 GMT
- Title: Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts
- Authors: Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang,
Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi
Du
- Abstract summary: We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
- Score: 55.282613372420805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due
to their ability to decouple model size from inference efficiency by only
activating a small subset of the model parameters for any given input token. As
such, sparse MoEs have enabled unprecedented scalability, resulting in
tremendous successes across domains such as natural language processing and
computer vision. In this work, we instead explore the use of sparse MoEs to
scale-down Vision Transformers (ViTs) to make them more attractive for
resource-constrained vision applications. To this end, we propose a simplified
and mobile-friendly MoE design where entire images rather than individual
patches are routed to the experts. We also propose a stable MoE training
procedure that uses super-class information to guide the router. We empirically
show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off
between performance and efficiency than the corresponding dense ViTs. For
example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense
counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only
54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.
Related papers
- MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices.
It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z) - DeViT: Decomposing Vision Transformers for Collaborative Inference in
Edge Devices [42.89175608336226]
Vision transformer (ViT) has achieved state-of-the-art performance on multiple computer vision benchmarks.
ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices.
We propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs.
arXiv Detail & Related papers (2023-09-10T12:26:17Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision
Models [40.40784209977589]
This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention.
We replace a standard Transformer block with a mobile convolution block, and further reorder it before the self-attention operation.
Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% top-1 accuracy on ImageNet-1K with ImageNet-22K pretraining.
arXiv Detail & Related papers (2022-10-04T18:00:06Z) - Separable Self-attention for Mobile Vision Transformers [34.32399598443582]
This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$.
The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection.
arXiv Detail & Related papers (2022-06-06T15:31:35Z) - MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks.
MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications.
Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z) - MiniViT: Compressing Vision Transformers with Weight Multiplexing [88.54212027516755]
Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability.
MiniViT is a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance.
arXiv Detail & Related papers (2022-04-14T17:59:05Z) - Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z) - Scaling Vision with Sparse Mixture of Experts [15.434534747230716]
We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks.
When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time.
arXiv Detail & Related papers (2021-06-10T17:10:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.