RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible
and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration
- URL: http://arxiv.org/abs/2301.03904v2
- Date: Sat, 6 May 2023 18:02:44 GMT
- Title: RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible
and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration
- Authors: Yvan Tortorella, Luca Bertaccini, Luca Benini, Davide Rossi, Francesco
Conti
- Abstract summary: Current training algorithms rely on floating-point matrix operations to meet the precision and dynamic range requirements.
RedMulE is a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration.
RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements.
- Score: 15.869673535117032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing interest in TinyML, i.e., near-sensor machine learning on
power budgets of a few tens of mW, is currently pushing toward enabling
TinyML-class training as opposed to inference only. Current training
algorithms, based on various forms of error and gradient backpropagation, rely
on floating-point matrix operations to meet the precision and dynamic range
requirements. So far, the energy and power cost of these operations has been
considered too high for TinyML scenarios. This paper addresses the open
challenge of near-sensor training on a few mW power budget and presents RedMulE
- Reduced-Precision Matrix Multiplication Engine, a low-power specialized
accelerator conceived for multi-precision floating-point General Matrix-Matrix
Operations (GEMM-Ops) acceleration, supporting FP16, as well as hybrid FP8
formats, with {sign, exponent, mantissa}=({1,4,3}, {1,5,2}). We integrate
RedMule into a Parallel Ultra-Low-Power (PULP) cluster containing eight
energy-efficient RISC-V cores sharing a tightly-coupled data memory and
implement the resulting system in a 22 nm technology. At its best efficiency
point (@ 470 MHz, 0.65 V), the RedMulE-augmented PULP cluster achieves 755
GFLOPS/W and 920 GFLOPS/W during regular General Matrix-Matrix Multiplication
(GEMM), and up to 1.19 TFLOPS/W and 1.67 TFLOPS/W when executing GEMM-Ops,
respectively, for FP16 and FP8 input/output tensors. In its best performance
point (@ 613 MHz, 0.8 V), RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for
FP16 and FP8, respectively, with 99.4% utilization of the array of Computing
Elements and consuming less than 60 mW on average, thus enabling on-device
training of deep learning models in TinyML application scenarios while
retaining the flexibility to tackle other classes of common linear algebra
problems efficiently.
Related papers
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z) - Reduced Precision Floating-Point Optimization for Deep Neural Network
On-Device Learning on MicroControllers [15.37318446043671]
This paper introduces a novel reduced precision optimization technique for On-Device Learning (ODL) primitives on MCU-class devices.
Our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs.
arXiv Detail & Related papers (2023-05-30T16:14:16Z) - LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight
Grouping for Multi-Agent Reinforcement Learning [2.0625936401496237]
Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems.
We present a real-time sparse training acceleration system named LearningGroup.
Our system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x.
arXiv Detail & Related papers (2022-10-29T15:09:34Z) - Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization [0.0]
Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU)
We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers.
Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
arXiv Detail & Related papers (2022-10-14T10:32:05Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z) - MicroNet: Towards Image Recognition with Extremely Low FLOPs [117.96848315180407]
MicroNet is an efficient convolutional neural network using extremely low computational cost.
A family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime.
For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.
arXiv Detail & Related papers (2020-11-24T18:59:39Z) - Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet
Implementation for Edge Motor-Imagery Brain--Machine Interfaces [16.381467082472515]
Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines.
Deep learning models have emerged for classifying EEG signals.
These models often exceed the limitations of edge devices due to their memory and computational requirements.
arXiv Detail & Related papers (2020-04-24T12:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.