Related papers: Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference

Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference

URL: http://arxiv.org/abs/2303.17878v1
Date: Fri, 31 Mar 2023 08:26:17 GMT
Title: Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference
Authors: Rafael Stahl, Daniel Mueller-Gritschneder, Ulf Schlichtmann
Abstract summary: Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML. DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. We propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs.
Score: 1.6094180182513644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. In this paper, we propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs, which, compared to existing tiling methods, reduces memory usage without inducing any run time overhead. FDT applies to a larger variety of network layers than existing tiling methods that focus on convolutions. It improves TinyML memory optimization significantly by reducing memory of models where this was not possible before and additionally providing alternative design points for models that show high run time overhead with existing methods. In order to identify the best tiling configuration, an end-to-end flow with a new path discovery method is proposed, which applies FDT and existing tiling methods in a fully automated way, including the scheduling of the operations and planning of the layout of buffers in memory. Out of seven evaluated models, FDT achieved significant memory reduction for two models by 76.2% and 18.1% where existing tiling methods could not be applied. Two other models showed a significant run time overhead with existing methods and FDT provided alternative design points with no overhead but reduced memory savings.

Related papers

mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling [0.5236468296934584]
mGRADE is a hybrid-memory system that integrates a temporal 1D-convolution with learnable spacings followed by a minimal gated recurrent unit.<n>We demonstrate that mGRADE effectively separates and preserves multi-scale temporal features.<n>This highlights mGRADE's promise as an efficient solution for memory-constrained multi-scale temporal processing at the edge.
arXiv Detail & Related papers (2025-07-02T15:44:35Z)
Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z)
Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory [0.8321953606016751]
We introduce memory-efficient gradient checkpointing strategies tailored for the general class of sparse RNNs and Spiking Neural Networks. We find that Double Checkpointing emerges as the most effective method, optimizing the use of local memory resources while minimizing recomputation overhead.
arXiv Detail & Related papers (2024-12-16T14:23:31Z)
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection [17.54863041098623]
We present COAP, a memory-efficient method that minimizes computational overhead while maintaining training performance. For LLaMA-1B, it reduces memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts memory by 81% and 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
arXiv Detail & Related papers (2024-11-26T03:50:52Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices [36.23767349592602]
Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design. FL fails to work on low-memory AIoT devices due to its heavy memory usage. We propose a federated foresight pruning method based on Neural Tangent Kernel (NTK), which can seamlessly integrate with federated BP-Free training frameworks.
arXiv Detail & Related papers (2024-05-08T02:24:09Z)
EcoTTA: Memory-Efficient Continual Test-time Adaptation via Self-distilled Regularization [71.70414291057332]
TTA may primarily be conducted on edge devices with limited memory. Long-term adaptation often leads to catastrophic forgetting and error accumulation. We present lightweight meta networks that can adapt the frozen original networks to the target domain.
arXiv Detail & Related papers (2023-03-03T13:05:30Z)
Improving Task-free Continual Learning by Distributionally Robust Memory Evolution [9.345559196495746]
Task-free continual learning aims to learn a non-stationary data stream without explicit task definitions and not forget previous knowledge. Existing methods overlook the high uncertainty in the memory data distribution. We propose a principled memory evolution framework to dynamically evolve the memory data distribution.
arXiv Detail & Related papers (2022-07-15T02:16:09Z)
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models. PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task. We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z)
A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work. We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z)
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs. We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z)
Generative Optimization Networks for Memory Efficient Data Generation [11.452816167207937]
We propose a novel framework called generative optimization networks (GON) that is similar to GANs, but does not use a generator. GONs use a single discriminator network and run optimization in the input space to generate new data samples, achieving an effective compromise between training time and memory consumption. We show that our framework gives up to 32% higher detection F1 scores and 58% lower memory consumption, with only 5% higher training overheads compared to the state-of-the-art.
arXiv Detail & Related papers (2021-10-06T16:54:33Z)
SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory. We propose StreaMRAK - a streaming version of KRR. We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z)
MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping. We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations. Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.