Reduced Precision Floating-Point Optimization for Deep Neural Network
On-Device Learning on MicroControllers
- URL: http://arxiv.org/abs/2305.19167v1
- Date: Tue, 30 May 2023 16:14:16 GMT
- Title: Reduced Precision Floating-Point Optimization for Deep Neural Network
On-Device Learning on MicroControllers
- Authors: Davide Nadalini, Manuele Rusci, Luca Benini, Francesco Conti
- Abstract summary: This paper introduces a novel reduced precision optimization technique for On-Device Learning (ODL) primitives on MCU-class devices.
Our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs.
- Score: 15.37318446043671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units
(MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep
Neural Network (DNN) models in future TinyML applications. This paper tackles
this challenge by introducing a novel reduced precision optimization technique
for ODL primitives on MCU-class devices, leveraging the State-of-Art
advancements in RISC-V RV32 architectures with support for vectorized 16-bit
floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our
approach for the Forward and Backward steps of the Back-Propagation training
algorithm is composed of specialized shape transform operators and Matrix
Multiplication (MM) kernels, accelerated with parallelization and loop
unrolling. When evaluated on a single training step of a 2D Convolution layer,
the SIMD-optimized FP16 primitives result up to 1.72$\times$ faster than the
FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency
of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81
MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN
for Image Classification and Keyword Spotting, respectively -- requiring 17.1
ms and 6.4 ms on the target platform to compute a training step on a single
sample. Overall, our approach results more than two orders of magnitude faster
than existing ODL software frameworks for single-core MCUs and outperforms by
1.6 $\times$ previous FP32 parallel implementations on a Continual Learning
setup.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Accelerating TinyML Inference on Microcontrollers through Approximate Kernels [3.566060656925169]
In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on microcontrollers.
Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our solutions can feature on average 21% latency reduction.
arXiv Detail & Related papers (2024-09-25T11:10:33Z) - Optimizing the Deployment of Tiny Transformers on Low-Power MCUs [12.905978154498499]
This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs.
Our framework provides an optimized library of kernels to maximize data reuse and avoid data marshaling operations into the crucial attention block.
We show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%.
arXiv Detail & Related papers (2024-04-03T14:14:08Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - RAMP: A Flat Nanosecond Optical Network and MPI Operations for
Distributed Deep Learning Systems [68.8204255655161]
We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP.
RAMP supports large-scale distributed and parallel computing systems (12.8Tbps per node for up to 65,536 nodes.
arXiv Detail & Related papers (2022-11-28T11:24:51Z) - Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization [0.0]
Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU)
We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers.
Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
arXiv Detail & Related papers (2022-10-14T10:32:05Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Multi-Precision Policy Enforced Training (MuPPET): A precision-switching
strategy for quantised fixed-point training of CNNs [13.83645579871775]
Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks.
This work pushes the boundary of quantised training by employing a multilevel approach that utilises multiple precisions.
MuPPET achieves the same accuracy as standard full-precision training with training-time speedup of up to 1.84$times$ and an average speedup of 1.58$times$ across the networks.
arXiv Detail & Related papers (2020-06-16T10:14:36Z) - Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11.
A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z) - ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for
Deep Learning [1.2019888796331233]
Matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) are the two most expensive operations associated with the training of deep neural networks (DNNs)
We introduce efficient techniques to SC for weight update in DNNs with the activation functions required by many state-of-the-art networks.
Our architecture reduces the computational cost by re-using random numbers and replacing certain FP multiplication operations by bit shift scaling.
Hardware design of ESSOP at 14nm technology node shows that, compared to a highly pipelined FP16 multiplier, ESSOP is 82.2% and 93.7% better in energy
arXiv Detail & Related papers (2020-03-25T07:54:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.