Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural
Network Inference
- URL: http://arxiv.org/abs/2303.17878v1
- Date: Fri, 31 Mar 2023 08:26:17 GMT
- Title: Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural
Network Inference
- Authors: Rafael Stahl, Daniel Mueller-Gritschneder, Ulf Schlichtmann
- Abstract summary: Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML.
DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage.
We propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs.
- Score: 1.6094180182513644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Memory optimization for deep neural network (DNN) inference gains high
relevance with the emergence of TinyML, which refers to the deployment of DNN
inference tasks on tiny, low-power microcontrollers. Applications such as audio
keyword detection or radar-based gesture recognition are heavily constrained by
the limited memory on such tiny devices because DNN inference requires large
intermediate run-time buffers to store activations and other intermediate data,
which leads to high memory usage. In this paper, we propose a new Fused
Depthwise Tiling (FDT) method for the memory optimization of DNNs, which,
compared to existing tiling methods, reduces memory usage without inducing any
run time overhead. FDT applies to a larger variety of network layers than
existing tiling methods that focus on convolutions. It improves TinyML memory
optimization significantly by reducing memory of models where this was not
possible before and additionally providing alternative design points for models
that show high run time overhead with existing methods. In order to identify
the best tiling configuration, an end-to-end flow with a new path discovery
method is proposed, which applies FDT and existing tiling methods in a fully
automated way, including the scheduling of the operations and planning of the
layout of buffers in memory. Out of seven evaluated models, FDT achieved
significant memory reduction for two models by 76.2% and 18.1% where existing
tiling methods could not be applied. Two other models showed a significant run
time overhead with existing methods and FDT provided alternative design points
with no overhead but reduced memory savings.
Related papers
- Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.
High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.
We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory [0.8321953606016751]
We introduce memory-efficient gradient checkpointing strategies tailored for the general class of sparse RNNs and Spiking Neural Networks.
We find that Double Checkpointing emerges as the most effective method, optimizing the use of local memory resources while minimizing recomputation overhead.
arXiv Detail & Related papers (2024-12-16T14:23:31Z) - COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection [11.655821671462427]
We present COAP, a memory-efficient method that minimizes computational overhead while maintaining training performance.
For LLaMA-1B, it reduces memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW.
With 8-bit quantization, COAP cuts memory by 81% and 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
arXiv Detail & Related papers (2024-11-26T03:50:52Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices [36.23767349592602]
Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design.
FL fails to work on low-memory AIoT devices due to its heavy memory usage.
We propose a federated foresight pruning method based on Neural Tangent Kernel (NTK), which can seamlessly integrate with federated BP-Free training frameworks.
arXiv Detail & Related papers (2024-05-08T02:24:09Z) - EcoTTA: Memory-Efficient Continual Test-time Adaptation via
Self-distilled Regularization [71.70414291057332]
TTA may primarily be conducted on edge devices with limited memory.
Long-term adaptation often leads to catastrophic forgetting and error accumulation.
We present lightweight meta networks that can adapt the frozen original networks to the target domain.
arXiv Detail & Related papers (2023-03-03T13:05:30Z) - LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models.
PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task.
We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping.
We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations.
Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.