TFormer: A Transmission-Friendly ViT Model for IoT Devices
- URL: http://arxiv.org/abs/2302.07734v1
- Date: Wed, 15 Feb 2023 15:36:10 GMT
- Title: TFormer: A Transmission-Friendly ViT Model for IoT Devices
- Authors: Zhichao Lu, Chuntao Ding, Felix Juefei-Xu, Vishnu Naresh Boddeti,
Shangguang Wang, and Yun Yang
- Abstract summary: This paper proposes a transmission-friendly ViT model, TFormer, for deployment on resource-constrained IoT devices with the assistance of a cloud server.
Experimental results on the ImageNet-1K, MS COCO, and ADE20K datasets for image classification, object detection, and semantic segmentation tasks demonstrate that the proposed model outperforms other state-of-the-art models.
- Score: 23.67389080796814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying high-performance vision transformer (ViT) models on ubiquitous
Internet of Things (IoT) devices to provide high-quality vision services will
revolutionize the way we live, work, and interact with the world. Due to the
contradiction between the limited resources of IoT devices and
resource-intensive ViT models, the use of cloud servers to assist ViT model
training has become mainstream. However, due to the larger number of parameters
and floating-point operations (FLOPs) of the existing ViT models, the model
parameters transmitted by cloud servers are large and difficult to run on
resource-constrained IoT devices. To this end, this paper proposes a
transmission-friendly ViT model, TFormer, for deployment on
resource-constrained IoT devices with the assistance of a cloud server. The
high performance and small number of model parameters and FLOPs of TFormer are
attributed to the proposed hybrid layer and the proposed partially connected
feed-forward network (PCS-FFN). The hybrid layer consists of nonlearnable
modules and a pointwise convolution, which can obtain multitype and multiscale
features with only a few parameters and FLOPs to improve the TFormer
performance. The PCS-FFN adopts group convolution to reduce the number of
parameters. The key idea of this paper is to propose TFormer with few model
parameters and FLOPs to facilitate applications running on resource-constrained
IoT devices to benefit from the high performance of the ViT models.
Experimental results on the ImageNet-1K, MS COCO, and ADE20K datasets for image
classification, object detection, and semantic segmentation tasks demonstrate
that the proposed model outperforms other state-of-the-art models.
Specifically, TFormer-S achieves 5% higher accuracy on ImageNet-1K than
ResNet18 with 1.4$\times$ fewer parameters and FLOPs.
Related papers
- ED-ViT: Splitting Vision Transformer for Distributed Inference on Edge Devices [13.533267828812455]
We propose a novel Vision Transformer splitting framework, ED-ViT, to execute complex models across multiple edge devices efficiently.
Specifically, we partition Vision Transformer models into several sub-models, where each sub-model is tailored to handle a specific subset of data classes.
We conduct extensive experiments on five datasets with three model structures, demonstrating that our approach significantly reduces inference latency on edge devices.
arXiv Detail & Related papers (2024-10-15T14:38:14Z) - OnDev-LCT: On-Device Lightweight Convolutional Transformers towards
federated learning [29.798780069556074]
Federated learning (FL) has emerged as a promising approach to collaboratively train machine learning models across multiple edge devices.
We propose OnDev-LCT: Lightweight Convolutional Transformers for On-Device vision tasks with limited training data and resources.
arXiv Detail & Related papers (2024-01-22T02:17:36Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - DeViT: Decomposing Vision Transformers for Collaborative Inference in
Edge Devices [42.89175608336226]
Vision transformer (ViT) has achieved state-of-the-art performance on multiple computer vision benchmarks.
ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices.
We propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs.
arXiv Detail & Related papers (2023-09-10T12:26:17Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.