MEMA Runtime Framework: Minimizing External Memory Accesses for TinyML
on Microcontrollers
- URL: http://arxiv.org/abs/2304.05544v1
- Date: Wed, 12 Apr 2023 00:27:11 GMT
- Title: MEMA Runtime Framework: Minimizing External Memory Accesses for TinyML
on Microcontrollers
- Authors: Andrew Sabot, Vikas Natesh, H.T. Kung, Wei-Te Ting
- Abstract summary: We present the MEMA framework for efficient inference runtimes that minimize external memory accesses for matrix multiplication on TinyML systems.
We compare the performance of runtimes derived from MEMA to existing state-of-the-art libraries on ARM-based TinyML systems.
- Score: 3.1823074562424756
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present the MEMA framework for the easy and quick derivation of efficient
inference runtimes that minimize external memory accesses for matrix
multiplication on TinyML systems. The framework accounts for hardware resource
constraints and problem sizes in analytically determining optimized schedules
and kernels that minimize memory accesses. MEMA provides a solution to a
well-known problem in the current practice, that is, optimal schedules tend to
be found only through a time consuming and heuristic search of a large
scheduling space. We compare the performance of runtimes derived from MEMA to
existing state-of-the-art libraries on ARM-based TinyML systems. For example,
for neural network benchmarks on the ARM Cortex-M4, we achieve up to a 1.8x
speedup and 44% energy reduction over CMSIS-NN.
Related papers
- MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - MLonMCU: TinyML Benchmarking with Fast Retargeting [1.4319942396517]
It is non-trivial to choose the optimal combination of frameworks and targets for a given application.
A tool called MLonMCU is proposed in this paper and demonstrated by benchmarking the state-of-the-art TinyML frameworks TFLite for Microcontrollers and TVM effortlessly.
arXiv Detail & Related papers (2023-06-15T08:44:35Z) - Pex: Memory-efficient Microcontroller Deep Learning through Partial
Execution [11.336229510791481]
We discuss a novel execution paradigm for microcontroller deep learning.
It modifies the execution of neural networks to avoid materialising full buffers in memory.
This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time.
arXiv Detail & Related papers (2022-11-30T18:47:30Z) - MinUn: Accurate ML Inference on Microcontrollers [2.2638536653874195]
Running machine learning inference on tiny devices, known as TinyML, is an emerging research area.
We describe MinUn, the first TinyML framework that holistically addresses these issues to generate efficient code for ARM microcontrollers.
arXiv Detail & Related papers (2022-10-29T10:16:12Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - TinyML Platforms Benchmarking [0.0]
Recent advances in ultra-low power embedded devices for machine learning (ML) have permitted a new class of products.
TinyML provides a unique solution by aggregating and analyzing data at the edge on low-power embedded devices.
Many TinyML frameworks have been developed for different platforms to facilitate the deployment of ML models.
arXiv Detail & Related papers (2021-11-30T15:26:26Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z) - MicroNets: Neural Network Architectures for Deploying TinyML
Applications on Commodity Microcontrollers [18.662026553041937]
Machine learning on resource constrained microcontrollers (MCUs) promises to drastically expand the application space of the Internet of Things (IoT)
TinyML presents severe technical challenges, as deep neural network inference demands a large compute and memory budget.
neural architecture search (NAS) promises to help design accurate ML models that meet the tight MCU memory, latency and energy constraints.
arXiv Detail & Related papers (2020-10-21T19:39:39Z) - MCUNet: Tiny Deep Learning on IoT Devices [62.752899523628066]
We propose a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine)
TinyNAS adopts a two-stage neural architecture search approach that first optimize the search space to fit the resource constraints, then specializes the network architecture in the optimized search space.
TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 4.8x.
arXiv Detail & Related papers (2020-07-20T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.