Realizing Unaligned Block-wise Pruning for DNN Acceleration on Mobile Devices
- URL: http://arxiv.org/abs/2407.19644v1
- Date: Mon, 29 Jul 2024 01:59:06 GMT
- Title: Realizing Unaligned Block-wise Pruning for DNN Acceleration on Mobile Devices
- Authors: Hayun Lee, Dongkun Shin,
- Abstract summary: Block-wise pruning is promising due to its low accuracy drop tradeoff for speedup gains.
Unaligned block pruning (UBP) addresses this by allowing blocks to be selected at arbitrary positions.
We propose a pseudo-optimal yet fast block selection algorithm called Block Expansion and Division.
- Score: 1.6114012813668932
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the recent proliferation of on-device AI, there is an increasing need to run computationally intensive DNNs directly on mobile devices. However, the limited computing and memory resources of these devices necessitate effective pruning techniques. Block-wise pruning is promising due to its low accuracy drop tradeoff for speedup gains, but it requires block positions to be aligned with block size, hindering optimal position selection to minimize model accuracy drop. Unaligned block pruning (UBP) addresses this by allowing blocks to be selected at arbitrary positions, yet its practical use is limited by a time-consuming optimal block selection algorithm and lack of efficient inference kernels. In this paper, we propose a pseudo-optimal yet fast block selection algorithm called Block Expansion and Division (BED), which can be integrated into an iterative model training process. Additionally, we introduce an efficient inference kernel implementation for mobile devices, enabling a UBP-based model to achieve similar latency to a DNN model compressed by aligned block pruning. We demonstrate the superiority of our techniques on a real mobile phone with MobileNet and ResNet models.
Related papers
- BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices [14.536949788395837]
Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden.
We develop a BFP-based bitwidth-aware analytical modeling framework (called BitQ'') for the best BFP implementation of DNN inference on embedded platforms.
arXiv Detail & Related papers (2024-09-25T17:03:49Z) - Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge [35.40849522296486]
Large-scale foundation models (FoMos) can perform human-like intelligence.
FoMos need to be adapted to specialized downstream tasks through fine-tuning techniques.
We advocate multi-device cooperation within the device-edge cooperative fine-tuning paradigm.
arXiv Detail & Related papers (2024-07-13T12:47:14Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - LegoDNN: Block-grained Scaling of Deep Neural Networks for Mobile Vision [27.74191483754982]
We present LegoDNN, a block-grained scaling solution for running multi-DNN workloads in mobile vision systems.
LegoDNN guarantees short model training times by only extracting and training a small number of common blocks.
We show that LegoDNN provides 1,296x to 279,936x more options in model sizes without increasing training time.
arXiv Detail & Related papers (2021-12-18T06:04:03Z) - Architecture Aware Latency Constrained Sparse Neural Networks [35.50683537052815]
In this paper, we design an architecture aware latency constrained sparse framework to prune and accelerate CNN models.
We also propose a novel sparse convolution algorithm for efficient computation.
Our system-algorithm co-design framework can achieve much better frontier among network accuracy and latency on resource-constrained mobile devices.
arXiv Detail & Related papers (2021-09-01T03:41:31Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z) - Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning
and Compiler Optimization [56.3111706960878]
High-end mobile platforms serve as primary computing devices for a wide range of Deep Neural Network (DNN) applications.
constrained computation and storage resources on these devices pose significant challenges for real-time inference executions.
We propose a set of hardware-friendly structured model pruning and compiler optimization techniques to accelerate DNN executions on mobile devices.
arXiv Detail & Related papers (2020-04-22T03:18:23Z) - A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration
Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices.
Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data.
We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z) - An Image Enhancing Pattern-based Sparsity for Real-time Inference on
Mobile Devices [58.62801151916888]
We introduce a new sparsity dimension, namely pattern-based sparsity that comprises pattern and connectivity sparsity, and becoming both highly accurate and hardware friendly.
Our approach on the new pattern-based sparsity naturally fits into compiler optimization for highly efficient DNN execution on mobile platforms.
arXiv Detail & Related papers (2020-01-20T16:17:36Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.