Related papers: Structural Pruning via Latency-Saliency Knapsack

Structural Pruning via Latency-Saliency Knapsack

URL: http://arxiv.org/abs/2210.06659v1
Date: Thu, 13 Oct 2022 01:41:59 GMT
Title: Structural Pruning via Latency-Saliency Knapsack
Authors: Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez
Abstract summary: Hardware-Aware Structural Pruning (HALP) HALP formulates structural pruning as a global resource allocation optimization problem. Uses latency lookup table to track latency reduction potential and global saliency score to gauge accuracy drop.
Score: 40.562285600570924
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Structural pruning can simplify network architecture and improve inference speed. We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget on targeting device. For filter importance ranking, HALP leverages latency lookup table to track latency reduction potential and global saliency score to gauge accuracy drop. Both metrics can be evaluated very efficiently during pruning, allowing us to reformulate global structural pruning under a reward maximization problem given target constraint. This makes the problem solvable via our augmented knapsack solver, enabling HALP to surpass prior work in pruning efficacy and accuracy-efficiency trade-off. We examine HALP on both classification and detection tasks, over varying networks, on ImageNet and VOC datasets, on different platforms. In particular, for ResNet-50/-101 pruning on ImageNet, HALP improves network throughput by $1.60\times$/$1.90\times$ with $+0.3\%$/$-0.2\%$ top-1 accuracy changes, respectively. For SSD pruning on VOC, HALP improves throughput by $1.94\times$ with only a $0.56$ mAP drop. HALP consistently outperforms prior art, sometimes by large margins. Project page at https://halp-neurips.github.io/.

Related papers

$\ exttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z)
Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module. We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH) In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z)
Accelerating Deep Neural Networks via Semi-Structured Activation Sparsity [0.0]
Exploiting sparsity in the network's feature maps is one of the ways to reduce its inference latency. We propose a solution to induce semi-structured activation sparsity exploitable through minor runtime modifications. Our approach yields a speed improvement of $1.25 times$ with a minimal accuracy drop of $1.1%$ for the ResNet18 model on the ImageNet dataset.
arXiv Detail & Related papers (2023-09-12T22:28:53Z)
Layer-adaptive Structured Pruning Guided by Latency [7.193554978191659]
Structured pruning can simplify network architecture and improve inference speed. We propose a global importance score SP-LAMP by deriving a global importance score LAMP from unstructured pruning to structured pruning. Experimental results in ResNet56 on CIFAR10 demonstrate that our algorithm achieves lower latency compared to alternative approaches.
arXiv Detail & Related papers (2023-05-23T11:18:37Z)
DeepReShape: Redesigning Neural Networks for Efficient Private Inference [3.7802450241986945]
Recent work has shown that FLOPs for PI can no longer be ignored and incur high latency penalties. We develop DeepReShape, a technique that optimize neural network architectures under PI's constraints.
arXiv Detail & Related papers (2023-04-20T18:27:02Z)
Neural Network Pruning by Cooperative Coevolution [16.0753044050118]
We propose a new filter pruning algorithm CCEP by cooperative coevolution. CCEP reduces the pruning space by a divide-and-conquer strategy. Experiments show that CCEP can achieve a competitive performance with the state-of-the-art pruning methods.
arXiv Detail & Related papers (2022-04-12T09:06:38Z)
Interspace Pruning: Using Adaptive Filter Representations to Improve Training of Sparse CNNs [69.3939291118954]
Unstructured pruning is well suited to reduce the memory footprint of convolutional neural networks (CNNs) Standard unstructured pruning (SP) reduces the memory footprint of CNNs by setting filter elements to zero. We introduce interspace pruning (IP), a general tool to improve existing pruning methods.
arXiv Detail & Related papers (2022-03-15T11:50:45Z)
Selective Network Linearization for Efficient Private Inference [49.937470642033155]
We propose a gradient-based algorithm that selectively linearizes ReLUs while maintaining prediction accuracy. The results demonstrate up to $4.25%$ more accuracy (iso-ReLU count at 50K) or $2.2times$ less latency (iso-accuracy at 70%) than the current state of the art.
arXiv Detail & Related papers (2022-02-04T19:00:24Z)
HALP: Hardware-Aware Latency Pruning [25.071902504529465]
Hardware-Aware Structural Pruning (HALP) HALP formulates structural pruning as a global resource allocation optimization problem. We examine HALP on both classification and detection tasks, over varying networks, on ImageNet and VOC datasets.
arXiv Detail & Related papers (2021-10-20T22:34:51Z)
HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT) HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach. Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning [82.54669314604097]
EagleEye is a simple yet efficient evaluation component based on adaptive batch normalization. It unveils a strong correlation between different pruned structures and their final settled accuracy. This module is also general to plug-in and improve some existing pruning algorithms.
arXiv Detail & Related papers (2020-07-06T01:32:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.