Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks
- URL: http://arxiv.org/abs/2408.03663v2
- Date: Mon, 14 Oct 2024 04:47:45 GMT
- Title: Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks
- Authors: Jaewook Lee, Yoel Park, Seulki Lee,
- Abstract summary: We introduce a memory-efficient CNN (convolutional neural network) for on-device vision tasks.
The proposed network classifies ImageNet with extremely low memory (i.e., 63 KB) while achieving competitive top-1 accuracy (i.e., 61.58%)
To the best of our knowledge, the memory usage of the proposed network is far smaller than state-of-the-art memory-efficient networks.
- Score: 2.9835839258066015
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we introduce a memory-efficient CNN (convolutional neural network), which enables resource-constrained low-end embedded and IoT devices to perform on-device vision tasks, such as image classification and object detection, using extremely low memory, i.e., only 63 KB on ImageNet classification. Based on the bottleneck block of MobileNet, we propose three design principles that significantly curtail the peak memory usage of a CNN so that it can fit the limited KB memory of the low-end device. First, 'input segmentation' divides an input image into a set of patches, including the central patch overlapped with the others, reducing the size (and memory requirement) of a large input image. Second, 'patch tunneling' builds independent tunnel-like paths consisting of multiple bottleneck blocks per patch, penetrating through the entire model from an input patch to the last layer of the network, maintaining lightweight memory usage throughout the whole network. Lastly, 'bottleneck reordering' rearranges the execution order of convolution operations inside the bottleneck block such that the memory usage remains constant regardless of the size of the convolution output channels. The experiment result shows that the proposed network classifies ImageNet with extremely low memory (i.e., 63 KB) while achieving competitive top-1 accuracy (i.e., 61.58\%). To the best of our knowledge, the memory usage of the proposed network is far smaller than state-of-the-art memory-efficient networks, i.e., up to 89x and 3.1x smaller than MobileNet (i.e., 5.6 MB) and MCUNet (i.e., 196 KB), respectively.
Related papers
- Robust and Efficient Memory Network for Video Object Segmentation [6.7995672846437305]
This paper proposes a Robust and Efficient Memory Network, or REMN, for studying semi-supervised video object segmentation (VOS)
We introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask.
Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $mathcalJ&F$ score of 86.3% and on YouTube-VOS 2018, with a $mathcalG$ over mean of 85.5%.
arXiv Detail & Related papers (2023-04-24T06:19:21Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Condensation-Net: Memory-Efficient Network Architecture with
Cross-Channel Pooling Layers and Virtual Feature Maps [28.992851280809205]
We propose an algorithm to process a specific network architecture (Condensation-Net) without increasing the maximum memory storage for feature maps.
Cross-channel pooling can improve the accuracy of object detection tasks, such as face detection.
The overhead to support the cross-channel pooling with the proposed hardware architecture is negligible small.
arXiv Detail & Related papers (2021-04-29T05:44:02Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - MoViNets: Mobile Video Networks for Efficient Video Recognition [52.49314494202433]
3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets.
We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs.
arXiv Detail & Related papers (2021-03-21T23:06:38Z) - Bottleneck Transformers for Visual Recognition [97.16013761605254]
We present BoTNet, a powerful backbone architecture that incorporates self-attention for vision tasks.
We present models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark.
We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.
arXiv Detail & Related papers (2021-01-27T18:55:27Z) - Out-of-core Training for Extremely Large-Scale Neural Networks With
Adaptive Window-Based Scheduling [4.903820815918411]
We propose a novel out-of-core algorithm that enables faster training of extremely large-scale neural networks with sizes larger than allotted GPU memory.
We apply virtual addressing technique, commonly performed in OS, to training of neural networks with out-of-core execution.
We successfully train ResNet-50 with 1440 batch-size with keeping training speed at 55%, which is 7.5x larger than the upper bound of physical memory.
arXiv Detail & Related papers (2020-10-27T07:40:04Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z) - Video Object Segmentation with Episodic Graph Memory Networks [198.74780033475724]
A graph memory network is developed to address the novel idea of "learning to update the segmentation model"
We exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges.
The proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks.
arXiv Detail & Related papers (2020-07-14T13:19:19Z) - Quantitative Analysis of Image Classification Techniques for
Memory-Constrained Devices [0.7373617024876725]
Convolutional Neural Networks, or CNNs, are the state of the art for image classification, but typically come at the cost of a large memory footprint.
In this paper, we compare CNNs with ProtoNN, Bonsai and FastGRNN when applied to 3-channel image classification using CIFAR-10.
We show that Direct Convolution CNNs perform best for all chosen budgets, with a top performance of 65.7% accuracy at a memory footprint of 58.23KB.
arXiv Detail & Related papers (2020-05-11T09:54:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.