Deep Compression for PyTorch Model Deployment on Microcontrollers
- URL: http://arxiv.org/abs/2103.15972v1
- Date: Mon, 29 Mar 2021 22:08:44 GMT
- Title: Deep Compression for PyTorch Model Deployment on Microcontrollers
- Authors: Eren Dogan, H. Fatih Ugurdag, Hasan Unlu
- Abstract summary: This paper adds model compression, specifically Deep Compression, to Unlu's earlier work on arXiv.
In the case of the LeNet-5 model, the memory footprint was reduced by 12.45x, and the inference speed was boosted by 2.57x.
- Score: 0.2578242050187029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural network deployment on low-cost embedded systems, hence on
microcontrollers (MCUs), has recently been attracting more attention than ever.
Since MCUs have limited memory capacity as well as limited compute-speed, it is
critical that we employ model compression, which reduces both memory and
compute-speed requirements. In this paper, we add model compression,
specifically Deep Compression, and further optimize Unlu's earlier work on
arXiv, which efficiently deploys PyTorch models on MCUs. First, we prune the
weights in convolutional and fully connected layers. Secondly, the remaining
weights and activations are quantized to 8-bit integers from 32-bit
floating-point. Finally, forward pass functions are compressed using special
data structures for sparse matrices, which store only nonzero weights (without
impacting performance and accuracy). In the case of the LeNet-5 model, the
memory footprint was reduced by 12.45x, and the inference speed was boosted by
2.57x.
Related papers
- SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models have been proven highly effective at generating high-quality images.
As these models grow larger, they require significantly more memory and suffer from higher latency.
In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits.
arXiv Detail & Related papers (2024-11-07T18:59:58Z) - BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices.
We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z) - NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks [30.224822087562163]
NeuZip is a new weight compression scheme based on the entropy of floating-point numbers in neural networks.
We significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB.
In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance.
arXiv Detail & Related papers (2024-10-28T01:12:20Z) - Less Memory Means smaller GPUs: Backpropagation with Compressed Activations [1.7065506903618906]
The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements.
Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators.
With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule.
arXiv Detail & Related papers (2024-09-18T11:57:05Z) - "Lossless" Compression of Deep Neural Networks: A High-dimensional
Neural Tangent Kernel Approach [49.744093838327615]
We provide a novel compression approach to wide and fully-connected emphdeep neural nets.
Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme.
arXiv Detail & Related papers (2024-03-01T03:46:28Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z) - Efficient Neural Network Deployment for Microcontroller [0.0]
This paper is going to explore and generalize convolution neural network deployment for microcontrollers.
The memory savings and performance will be compared with CMSIS-NN framework developed for ARM Cortex-M CPUs.
The final purpose is to develop a tool consuming PyTorch model with trained network weights, and it turns into an optimized inference engine in C/C++ for low memory(kilobyte level) and limited computing capable microcontrollers.
arXiv Detail & Related papers (2020-07-02T19:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.