Optimizing the Deployment of Tiny Transformers on Low-Power MCUs
- URL: http://arxiv.org/abs/2404.02945v1
- Date: Wed, 3 Apr 2024 14:14:08 GMT
- Title: Optimizing the Deployment of Tiny Transformers on Low-Power MCUs
- Authors: Victor J. B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini,
- Abstract summary: This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs.
Our framework provides an optimized library of kernels to maximize data reuse and avoid data marshaling operations into the crucial attention block.
We show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%.
- Score: 12.905978154498499
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.
Related papers
- Accelerating TinyML Inference on Microcontrollers through Approximate Kernels [3.566060656925169]
In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on microcontrollers.
Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our solutions can feature on average 21% latency reduction.
arXiv Detail & Related papers (2024-09-25T11:10:33Z) - SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue.
SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation.
Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z) - SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models.
SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.
Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.
Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z) - SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design [5.962184741057505]
This paper aims to address computational redundancy at all design levels in a memory-efficient manner.
We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance.
We introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff.
arXiv Detail & Related papers (2024-01-29T09:12:23Z) - MCUFormer: Deploying Vision Transformers on Microcontrollers with
Limited Memory [76.02294791513552]
We propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory.
Experimental results demonstrate that our MCUFormer achieves 73.62% top-1 accuracy on ImageNet for image classification with 320KB memory.
arXiv Detail & Related papers (2023-10-25T18:00:26Z) - Reduced Precision Floating-Point Optimization for Deep Neural Network
On-Device Learning on MicroControllers [15.37318446043671]
This paper introduces a novel reduced precision optimization technique for On-Device Learning (ODL) primitives on MCU-class devices.
Our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs.
arXiv Detail & Related papers (2023-05-30T16:14:16Z) - Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization [0.0]
Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU)
We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers.
Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
arXiv Detail & Related papers (2022-10-14T10:32:05Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - Bilaterally Slimmable Transformer for Elastic and Efficient Visual
Question Answering [75.86788916930377]
bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models.
One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2.
Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
arXiv Detail & Related papers (2022-03-24T02:26:04Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.