Related papers: SNP: Structured Neuron-level Pruning to Preserve Attention Scores

SNP: Structured Neuron-level Pruning to Preserve Attention Scores

URL: http://arxiv.org/abs/2404.11630v1
Date: Thu, 18 Apr 2024 03:21:28 GMT
Title: SNP: Structured Neuron-level Pruning to Preserve Attention Scores
Authors: Kyunghwan Shim, Jaewoong Yun, Shinkook Choi,
Abstract summary: Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs) We propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP) Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors.
Score: 2.4204190488008046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs), which have achieved great success in various vision tasks. However, their high computational cost and memory footprint hinder their deployment on resource-constrained devices. Conventional pruning approaches can only compress and accelerate the MSA module using head pruning, although the head is not an atomic unit. To address this issue, we propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP). SNP prunes neurons with less informative attention scores and eliminates redundancy among heads. Specifically, it prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors. For instance, the DeiT-Small with SNP runs 3.1$\times$ faster than the original model and achieves performance that is 21.94\% faster and 1.12\% higher than the DeiT-Tiny. Additionally, SNP combine successfully with conventional head or block pruning approaches. SNP with head pruning could compress the DeiT-Base by 80\% of the parameters and computational costs and achieve 3.85$\times$ faster inference speed on RTX3090 and 4.93$\times$ on Jetson Nano.

Related papers

Transformer Neural Processes - Kernel Regression [2.309018557701645]
We introduce the Transformer Neural Process - Kernel Regression (TNP-KR), a scalable Neural Process (NP) TNP-KR features a Kernel Regression Block (KR-Block), a simple, parameter, and efficient transformer block with complexity $O(n_c2 + n_c n_t)$, and two novel attention mechanisms: scan attention (SA), a memory-efficient scan-based bias, and deep kernel attention (DKA), a Performer-style attention that implicitly incoporates a distance bias. These enhancements enable both TNP-KR variants to perform inference with 100K
arXiv Detail & Related papers (2024-11-19T13:40:49Z)
Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons [27.289945121113277]
We introduce DemP, a method that controls the proliferation of dead neurons, dynamically leading to sparsity. Experiments on CIFAR10 and ImageNet datasets demonstrate superior accuracy-sparsity tradeoffs.
arXiv Detail & Related papers (2024-03-12T14:28:06Z)
Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO) TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models. Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z)
Versatile Neural Processes for Learning Implicit Neural Representations [57.090658265140384]
We propose Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals.
arXiv Detail & Related papers (2023-01-21T04:08:46Z)
Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration [138.24994198567794]
iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT) Fast-iTPN can accelerate the inference procedure by up to 70%, with negligible performance loss.
arXiv Detail & Related papers (2022-11-23T06:56:12Z)
Receding Neuron Importances for Structured Pruning [11.375436522599133]
Structured pruning efficiently compresses networks by identifying and removing unimportant neurons. We introduce a simple BatchNorm variation with bounded scaling parameters, based on which we design a novel regularisation term that suppresses only neurons with low importance. We show that neural networks trained this way can be pruned to a larger extent and with less deterioration.
arXiv Detail & Related papers (2022-04-13T14:08:27Z)
Weight, Block or Unit? Exploring Sparsity Tradeoffs for Speech Enhancement on Tiny Neural Accelerators [4.1070979067056745]
We explore network sparsification strategies with the aim of compressing neural speech enhancement (SE) down to an optimal configuration for a new generation of low power microcontroller based neural accelerators (microNPU's) We examine three unique sparsity structures: weight pruning, block pruning and unit pruning; and discuss their benefits and drawbacks when applied to SE.
arXiv Detail & Related papers (2021-11-03T17:06:36Z)
GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest. GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel. Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z)
1$\times$N Block Pattern for Network Sparsity [90.43191747596491]
We propose one novel concept of $1times N$ block sparsity pattern (block pruning) to break this limitation. Our pattern obtains about 3.0% improvements over filter pruning in the top-1 accuracy of MobileNet-V2. It also obtains 56.04ms inference savings on Cortex-A7 CPU over weight pruning.
arXiv Detail & Related papers (2021-05-31T05:50:33Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
Network Automatic Pruning: Start NAP and Take a Nap [94.14675930881366]
We propose NAP, a unified and automatic pruning framework for both fine-grained and structured pruning. It can find out unimportant components of a network and automatically decide appropriate compression ratios for different layers. Despite its simpleness to use, NAP outperforms previous pruning methods by large margins.
arXiv Detail & Related papers (2021-01-17T07:09:19Z)
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning [10.981433334942476]
We present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0x with no accuracy loss, and achieves 1.6x, 3.0x, 162x, 347x speedup, and 1,4x, 3.2x, 1193x, 4059x energy savings.
arXiv Detail & Related papers (2020-12-17T18:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.