Related papers: MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory

URL: http://arxiv.org/abs/2310.16898v3
Date: Thu, 21 Dec 2023 14:56:46 GMT
Title: MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory
Authors: Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
Abstract summary: We propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory. Experimental results demonstrate that our MCUFormer achieves 73.62% top-1 accuracy on ImageNet for image classification with 320KB memory.
Score: 76.02294791513552
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73.62\% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https://github.com/liangyn22/MCUFormer.

Related papers

Efficient and accurate neural field reconstruction using resistive memory [52.68088466453264]
Traditional signal reconstruction methods on digital computers face both software and hardware challenges. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
arXiv Detail & Related papers (2024-04-15T09:33:09Z)
Optimizing the Deployment of Tiny Transformers on Low-Power MCUs [12.905978154498499]
This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid data marshaling operations into the crucial attention block. We show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%.
arXiv Detail & Related papers (2024-04-03T14:14:08Z)
Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants. We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z)
Pex: Memory-efficient Microcontroller Deep Learning through Partial Execution [11.336229510791481]
We discuss a novel execution paradigm for microcontroller deep learning. It modifies the execution of neural networks to avoid materialising full buffers in memory. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time.
arXiv Detail & Related papers (2022-11-30T18:47:30Z)
Row-wise Accelerator for Vision Transformer [4.802171139840781]
This paper proposes the hardware accelerator for vision transformers with row-wise scheduling. The implementation with TSMC 40nm CMOS technology only requires 262K gate count and 149KB buffer for 403.2 GOPS throughput at 600MHz clock frequency.
arXiv Detail & Related papers (2022-05-09T01:47:44Z)
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs. We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z)
HRFormer: High-Resolution Transformer for Dense Prediction [99.6060997466614]
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet) We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks.
arXiv Detail & Related papers (2021-10-18T15:37:58Z)
Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.