Related papers: Vision Transformers for Mobile Applications: A Short Survey

Vision Transformers for Mobile Applications: A Short Survey

URL: http://arxiv.org/abs/2305.19365v1
Date: Tue, 30 May 2023 19:12:08 GMT
Title: Vision Transformers for Mobile Applications: A Short Survey
Authors: Nahid Alam, Steven Kolawole, Simardeep Sethi, Nishant Bansali, Karina Nguyen
Abstract summary: Vision Transformers (ViTs) have demonstrated state-of-the-art performance on many Computer Vision Tasks. deploying large-scale ViTs is resource-consuming and impossible for many mobile devices. We look into a few ViTs specifically designed for mobile applications and observe that they modify the transformer's architecture or are built around the combination of CNN and transformer.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) have demonstrated state-of-the-art performance on many Computer Vision Tasks. Unfortunately, deploying these large-scale ViTs is resource-consuming and impossible for many mobile devices. While most in the community are building for larger and larger ViTs, we ask a completely opposite question: How small can a ViT be within the tradeoffs of accuracy and inference latency that make it suitable for mobile deployment? We look into a few ViTs specifically designed for mobile applications and observe that they modify the transformer's architecture or are built around the combination of CNN and transformer. Recent work has also attempted to create sparse ViT networks and proposed alternatives to the attention module. In this paper, we study these architectures, identify the challenges and analyze what really makes a vision transformer suitable for mobile applications. We aim to serve as a baseline for future research direction and hopefully lay the foundation to choose the exemplary vision transformer architecture for your application running on mobile devices.

Related papers

ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain [0.0]
Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems. ViTs can overcome several possible difficulties with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-10-09T12:31:30Z)
PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
Compressing Vision Transformers for Low-Resource Visual Learning [7.662469543657508]
Vision transformer (ViT) and its variants offer state-of-the-art accuracy in tasks such as image classification, object detection, and semantic segmentation. These models are large and computation-heavy, making their deployment on mobile and edge scenarios limited. We aim to take a step toward bringing vision transformers to the edge by utilizing popular model compression techniques such as distillation, pruning, and quantization.
arXiv Detail & Related papers (2023-09-05T23:33:39Z)
Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency. We also introduce a novel fine-grained joint search strategy for transformer models. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z)
Self-Distilled Vision Transformer for Domain Generalization [58.76055100157651]
Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks. We propose a simple DG approach for ViTs, coined as self-distillation for ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
arXiv Detail & Related papers (2022-07-25T17:57:05Z)
MoCoViT: Mobile Convolutional Vision Transformer [13.233314183471213]
We present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by transformer into mobile convolutional networks. MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications. Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer on various vision tasks.
arXiv Detail & Related papers (2022-05-25T10:21:57Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [24.47196590256829]
We introduce MobileViT, a light-weight vision transformer for mobile devices. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.
arXiv Detail & Related papers (2021-10-05T17:07:53Z)
TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [87.75122600164167]
We argue that the standard representation -- bounding boxes -- is not adapted to learning transformers for multiple-object tracking. We propose TransCenter, the first transformer-based architecture for tracking the centers of multiple targets.
arXiv Detail & Related papers (2021-03-28T14:49:36Z)
A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.