TFormer: A Transmission-Friendly ViT Model for IoT Devices
- URL: http://arxiv.org/abs/2302.07734v1
- Date: Wed, 15 Feb 2023 15:36:10 GMT
- Title: TFormer: A Transmission-Friendly ViT Model for IoT Devices
- Authors: Zhichao Lu, Chuntao Ding, Felix Juefei-Xu, Vishnu Naresh Boddeti,
Shangguang Wang, and Yun Yang
- Abstract summary: This paper proposes a transmission-friendly ViT model, TFormer, for deployment on resource-constrained IoT devices with the assistance of a cloud server.
Experimental results on the ImageNet-1K, MS COCO, and ADE20K datasets for image classification, object detection, and semantic segmentation tasks demonstrate that the proposed model outperforms other state-of-the-art models.
- Score: 23.67389080796814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying high-performance vision transformer (ViT) models on ubiquitous
Internet of Things (IoT) devices to provide high-quality vision services will
revolutionize the way we live, work, and interact with the world. Due to the
contradiction between the limited resources of IoT devices and
resource-intensive ViT models, the use of cloud servers to assist ViT model
training has become mainstream. However, due to the larger number of parameters
and floating-point operations (FLOPs) of the existing ViT models, the model
parameters transmitted by cloud servers are large and difficult to run on
resource-constrained IoT devices. To this end, this paper proposes a
transmission-friendly ViT model, TFormer, for deployment on
resource-constrained IoT devices with the assistance of a cloud server. The
high performance and small number of model parameters and FLOPs of TFormer are
attributed to the proposed hybrid layer and the proposed partially connected
feed-forward network (PCS-FFN). The hybrid layer consists of nonlearnable
modules and a pointwise convolution, which can obtain multitype and multiscale
features with only a few parameters and FLOPs to improve the TFormer
performance. The PCS-FFN adopts group convolution to reduce the number of
parameters. The key idea of this paper is to propose TFormer with few model
parameters and FLOPs to facilitate applications running on resource-constrained
IoT devices to benefit from the high performance of the ViT models.
Experimental results on the ImageNet-1K, MS COCO, and ADE20K datasets for image
classification, object detection, and semantic segmentation tasks demonstrate
that the proposed model outperforms other state-of-the-art models.
Specifically, TFormer-S achieves 5% higher accuracy on ImageNet-1K than
ResNet18 with 1.4$\times$ fewer parameters and FLOPs.
Related papers
- MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets [3.8601741392210434]
Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies.
We present a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks to model different scales of attention.
Our model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs, without pre-training on large datasets.
arXiv Detail & Related papers (2025-01-10T15:18:05Z) - Slicing Vision Transformer for Flexible Inference [79.35046907288518]
We propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs.
S Scala achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters.
arXiv Detail & Related papers (2024-12-06T05:31:42Z) - OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models.
At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone.
OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
arXiv Detail & Related papers (2024-11-22T17:55:15Z) - ED-ViT: Splitting Vision Transformer for Distributed Inference on Edge Devices [13.533267828812455]
We propose a novel Vision Transformer splitting framework, ED-ViT, to execute complex models across multiple edge devices efficiently.
Specifically, we partition Vision Transformer models into several sub-models, where each sub-model is tailored to handle a specific subset of data classes.
We conduct extensive experiments on five datasets with three model structures, demonstrating that our approach significantly reduces inference latency on edge devices.
arXiv Detail & Related papers (2024-10-15T14:38:14Z) - OnDev-LCT: On-Device Lightweight Convolutional Transformers towards
federated learning [29.798780069556074]
Federated learning (FL) has emerged as a promising approach to collaboratively train machine learning models across multiple edge devices.
We propose OnDev-LCT: Lightweight Convolutional Transformers for On-Device vision tasks with limited training data and resources.
arXiv Detail & Related papers (2024-01-22T02:17:36Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.