Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic
Programming
- URL: http://arxiv.org/abs/2301.12187v2
- Date: Fri, 2 Jun 2023 15:46:38 GMT
- Title: Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic
Programming
- Authors: Jinuk Kim, Yeonwoo Jeong, Deokjae Lee, Hyun Oh Song
- Abstract summary: We propose a novel depth compression algorithm which targets general convolution operations.
We achieve $1.41times$ speed-up with $0.11%p accuracy gain in MobileNetV2-1.0 on the ImageNet.
- Score: 15.458305667190256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works on neural network pruning advocate that reducing the depth of
the network is more effective in reducing run-time memory usage and
accelerating inference latency than reducing the width of the network through
channel pruning. In this regard, some recent works propose depth compression
algorithms that merge convolution layers. However, the existing algorithms have
a constricted search space and rely on human-engineered heuristics. In this
paper, we propose a novel depth compression algorithm which targets general
convolution operations. We propose a subset selection problem that replaces
inefficient activation layers with identity functions and optimally merges
consecutive convolution operations into shallow equivalent convolution
operations for efficient end-to-end inference latency. Since the proposed
subset selection problem is NP-hard, we formulate a surrogate optimization
problem that can be solved exactly via two-stage dynamic programming within a
few seconds. We evaluate our methods and baselines by TensorRT for a fair
inference latency comparison. Our method outperforms the baseline method with
higher accuracy and faster inference speed in MobileNetV2 on the ImageNet
dataset. Specifically, we achieve $1.41\times$ speed-up with $0.11$\%p accuracy
gain in MobileNetV2-1.0 on the ImageNet.
Related papers
- LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging [20.774060844559838]
Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer.
These methods suffer from a critical drawback; the kernel size of the merged layers becomes larger.
We show that this problem can be addressed by jointly pruning convolution layers and activation functions.
We propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove.
arXiv Detail & Related papers (2024-06-18T17:55:15Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Latency-aware Spatial-wise Dynamic Networks [33.88843632160247]
We propose a latency-aware spatial-wise dynamic network (LASNet) for deep networks.
LASNet performs coarse-grained spatially adaptive inference under the guidance of a novel latency prediction model.
Experiments on image classification, object detection and instance segmentation demonstrate that the proposed framework significantly improves the practical inference efficiency of deep networks.
arXiv Detail & Related papers (2022-10-12T14:09:27Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT)
HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach.
Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z) - Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks.
The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z) - FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale
Context Aggregation and Feature Space Super-resolution [14.226301825772174]
We introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP)
It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information.
We achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card.
arXiv Detail & Related papers (2020-03-09T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.