MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference
- URL: http://arxiv.org/abs/2107.06960v2
- Date: Tue, 18 Jul 2023 20:41:53 GMT
- Title: MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference
- Authors: Jackson Farley, Andreas Gerstlauer
- Abstract summary: Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping.
We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations.
Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
- Score: 1.7894377200944507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A rising research challenge is running costly machine learning (ML) networks
locally on resource-constrained edge devices. ML networks with large
convolutional layers can easily exceed available memory, increasing latency due
to excessive OS swapping. Previous memory reduction techniques such as pruning
and quantization reduce model accuracy and often require retraining.
Alternatively, distributed methods partition the convolutions into equivalent
smaller sub-computations, but the implementations introduce communication costs
and require a network of devices. Distributed partitioning approaches can,
however, also be used to run in a reduced memory footprint on a single device
by subdividing the network into smaller operations. In this paper, we extend
prior work on distributed partitioning into a memory-aware execution on a
single device. Our approach extends prior fusing strategies to allow for
multiple groups of convolutional layers that are fused and tiled independently.
This enables trading off overhead versus data reuse in order to specifically
reduces memory footprint. We propose a memory usage predictor coupled with a
search algorithm to provide optimized fusing and tiling configurations for an
arbitrary set of convolutional layers. When applied to the YOLOv2 object
detection network, results show that our approach can run in less than half the
memory, and with a speedup of up to 2.78 under severe memory constraints.
Additionally, our algorithm will return a configuration with a latency that is
within 6% of the best latency measured in a manual search.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural
Network Inference [1.6094180182513644]
Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML.
DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage.
We propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs.
arXiv Detail & Related papers (2023-03-31T08:26:17Z) - Rediscovering Hashed Random Projections for Efficient Quantization of
Contextualized Sentence Embeddings [113.38884267189871]
Training and inference on edge devices often requires an efficient setup due to computational limitations.
Pre-computing data representations and caching them on a server can mitigate extensive edge device computation.
We propose a simple, yet effective approach that uses randomly hyperplane projections.
We show that the embeddings remain effective for training models across various English and German sentence classification tasks that retain 94%--99% of their floating-point.
arXiv Detail & Related papers (2023-03-13T10:53:00Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Generative Optimization Networks for Memory Efficient Data Generation [11.452816167207937]
We propose a novel framework called generative optimization networks (GON) that is similar to GANs, but does not use a generator.
GONs use a single discriminator network and run optimization in the input space to generate new data samples, achieving an effective compromise between training time and memory consumption.
We show that our framework gives up to 32% higher detection F1 scores and 58% lower memory consumption, with only 5% higher training overheads compared to the state-of-the-art.
arXiv Detail & Related papers (2021-10-06T16:54:33Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Group Fisher Pruning for Practical Network Compression [58.25776612812883]
We present a general channel pruning approach that can be applied to various complicated structures.
We derive a unified metric based on Fisher information to evaluate the importance of a single channel and coupled channels.
Our method can be used to prune any structures including those with coupled channels.
arXiv Detail & Related papers (2021-08-02T08:21:44Z) - Improving Memory Utilization in Convolutional Neural Network
Accelerators [16.340620299847384]
We propose a mapping method that allows activation layers to overlap and thus utilize the memory more efficiently.
Experiments with various real-world object detector networks show that the proposed mapping technique can decrease the activations memory by up to 32.9%.
For higher resolution de-noising networks, we achieve activation memory savings of 48.8%.
arXiv Detail & Related papers (2020-07-20T09:34:36Z) - Splitting Convolutional Neural Network Structures for Efficient
Inference [11.031841470875571]
A new technique is proposed to split the network structure into small parts that consume lower memory than the original network.
The split approach has been tested on two well-known network structures of VGG16 and ResNet18 for the classification of CIFAR10 images.
arXiv Detail & Related papers (2020-02-09T06:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.