Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural
Network Inference
- URL: http://arxiv.org/abs/2303.17878v1
- Date: Fri, 31 Mar 2023 08:26:17 GMT
- Title: Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural
Network Inference
- Authors: Rafael Stahl, Daniel Mueller-Gritschneder, Ulf Schlichtmann
- Abstract summary: Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML.
DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage.
We propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs.
- Score: 1.6094180182513644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Memory optimization for deep neural network (DNN) inference gains high
relevance with the emergence of TinyML, which refers to the deployment of DNN
inference tasks on tiny, low-power microcontrollers. Applications such as audio
keyword detection or radar-based gesture recognition are heavily constrained by
the limited memory on such tiny devices because DNN inference requires large
intermediate run-time buffers to store activations and other intermediate data,
which leads to high memory usage. In this paper, we propose a new Fused
Depthwise Tiling (FDT) method for the memory optimization of DNNs, which,
compared to existing tiling methods, reduces memory usage without inducing any
run time overhead. FDT applies to a larger variety of network layers than
existing tiling methods that focus on convolutions. It improves TinyML memory
optimization significantly by reducing memory of models where this was not
possible before and additionally providing alternative design points for models
that show high run time overhead with existing methods. In order to identify
the best tiling configuration, an end-to-end flow with a new path discovery
method is proposed, which applies FDT and existing tiling methods in a fully
automated way, including the scheduling of the operations and planning of the
layout of buffers in memory. Out of seven evaluated models, FDT achieved
significant memory reduction for two models by 76.2% and 18.1% where existing
tiling methods could not be applied. Two other models showed a significant run
time overhead with existing methods and FDT provided alternative design points
with no overhead but reduced memory savings.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices [36.23767349592602]
Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design.
FL fails to work on low-memory AIoT devices due to its heavy memory usage.
We propose a federated foresight pruning method based on Neural Tangent Kernel (NTK), which can seamlessly integrate with federated BP-Free training frameworks.
arXiv Detail & Related papers (2024-05-08T02:24:09Z) - EcoTTA: Memory-Efficient Continual Test-time Adaptation via
Self-distilled Regularization [71.70414291057332]
TTA may primarily be conducted on edge devices with limited memory.
Long-term adaptation often leads to catastrophic forgetting and error accumulation.
We present lightweight meta networks that can adapt the frozen original networks to the target domain.
arXiv Detail & Related papers (2023-03-03T13:05:30Z) - Improving Task-free Continual Learning by Distributionally Robust Memory
Evolution [9.345559196495746]
Task-free continual learning aims to learn a non-stationary data stream without explicit task definitions and not forget previous knowledge.
Existing methods overlook the high uncertainty in the memory data distribution.
We propose a principled memory evolution framework to dynamically evolve the memory data distribution.
arXiv Detail & Related papers (2022-07-15T02:16:09Z) - LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models.
PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task.
We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Generative Optimization Networks for Memory Efficient Data Generation [11.452816167207937]
We propose a novel framework called generative optimization networks (GON) that is similar to GANs, but does not use a generator.
GONs use a single discriminator network and run optimization in the input space to generate new data samples, achieving an effective compromise between training time and memory consumption.
We show that our framework gives up to 32% higher detection F1 scores and 58% lower memory consumption, with only 5% higher training overheads compared to the state-of-the-art.
arXiv Detail & Related papers (2021-10-06T16:54:33Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping.
We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations.
Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.