Scaling Vision with Sparse Mixture of Experts
- URL: http://arxiv.org/abs/2106.05974v1
- Date: Thu, 10 Jun 2021 17:10:56 GMT
- Title: Scaling Vision with Sparse Mixture of Experts
- Authors: Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann,
Rodolphe Jenatton, Andr\'e Susano Pinto, Daniel Keysers, Neil Houlsby
- Abstract summary: We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks.
When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time.
- Score: 15.434534747230716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent
scalability in Natural Language Processing. In Computer Vision, however, almost
all performant networks are "dense", that is, every input is processed by every
parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision
Transformer, that is scalable and competitive with the largest dense networks.
When applied to image recognition, V-MoE matches the performance of
state-of-the-art networks, while requiring as little as half of the compute at
inference time. Further, we propose an extension to the routing algorithm that
can prioritize subsets of each input across the entire batch, leading to
adaptive per-image compute. This allows V-MoE to trade-off performance and
compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE
to scale vision models, and train a 15B parameter model that attains 90.35% on
ImageNet.
Related papers
- ViTamin: Designing Scalable Vision Models in the Vision-Language Era [26.878662961209997]
Vision Transformers (ViTs) remain the default choice for the image encoder.
ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy.
ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy.
arXiv Detail & Related papers (2024-04-02T17:40:29Z) - Vision Mamba: Efficient Visual Representation Learning with
Bidirectional State Space Model [51.10876815815515]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim)
Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models.
The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - TiC: Exploring Vision Transformer in Convolution [37.50285921899263]
We propose the Multi-Head Self-Attention Convolution (MSA-Conv)
MSA-Conv incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones.
We present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv.
arXiv Detail & Related papers (2023-10-06T10:16:26Z) - Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - MULLER: Multilayer Laplacian Resizer for Vision [16.67232499096539]
We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer.
We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
arXiv Detail & Related papers (2023-04-06T04:39:21Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.