Related papers: HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

URL: http://arxiv.org/abs/2602.06069v1
Date: Mon, 02 Feb 2026 18:17:45 GMT
Title: HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference
Authors: Dinesh Gopalan, Ratul Ali,
Abstract summary: Hybrid Quantization and Pruning (HQP) framework designed to achieve synergistic model acceleration.<n>HQP framework achieves a peak performance gain of 3.12 times inference speedup and a 55 percent model size reduction.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The escalating demand for high-fidelity, real-time inference in distributed edge-cloud environments necessitates aggressive model optimization to counteract severe latency and energy constraints. This paper introduces the Hybrid Quantization and Pruning (HQP) framework, a novel, integrated methodology designed to achieve synergistic model acceleration while adhering to strict quality guarantees. We detail a sensitivity-aware structural pruning algorithm that employs a dynamic weight sensitivity metric, derived from a highly efficient approximation of the Fisher Information Matrix (FIM), to guide the iterative removal of redundant filters. This pruning is strictly conditional, enforcing an adherence to a maximum permissible accuracy drop (Delta ax) before the model proceeds to 8-bit post-training quantization. This rigorous coordination is critical, as it ensures the resultant sparse model structure is maximally robust to quantization error and hardware-specific kernel optimization. Exhaustive evaluation across heterogeneous NVIDIA Jetson edge platforms, utilizing resource-efficient architectures like MobileNetV3 and ResNet-18, demonstrates that the HQP framework achieves a peak performance gain of 3.12 times inference speedup and a 55 percent model size reduction, while rigorously containing the accuracy drop below the 1.5 percent constraint. A comprehensive comparative analysis against conventional single-objective compression techniques validates the HQP framework as a superior, hardware-agnostic solution for deploying ultra-low-latency AI in resource-limited edge infrastructures.

Related papers

Constrained Dynamic Gaussian Splatting [47.982650444869336]
Constrained Dynamic Gaussian Splatting (CDGS) is a novel framework that formulates dynamic scene reconstruction as a budget-constrained optimization problem.<n>We show that CDGS delivers optimal rendering quality under varying capacity limits, achieving over 3x compression compared to state-of-the-art methods.
arXiv Detail & Related papers (2026-02-03T13:53:29Z)
Deep Unfolded Fractional Optimization for Maximizing Robust Throughput in 6G Networks [5.855866479962828]
6G wireless communication networks aim to leverage artificial intelligence tools for efficient and robust network optimization.<n>This paper considers a multi-antenna base station serving multiple users simultaneously through transmit beamforming in downlink mode.<n>To account for robustness, this work proposes an uncertainty-injected deep unfolded fractional programming framework for weighted sum rate (WSR) case.
arXiv Detail & Related papers (2026-01-27T09:56:38Z)
NOVAK: Unified adaptive optimizer for deep neural networks [0.0]
NOVAK is a gradient-based optimization algorithm that integrates adaptive moment estimation, rectified learning-rate scheduling, decoupled weight regularization, multiple variants of Nesterov momentum, and lookahead synchronization into a unified, performance-oriented framework.
arXiv Detail & Related papers (2026-01-11T13:03:57Z)
A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs [7.577235739757108]
Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge.<n>This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations.
arXiv Detail & Related papers (2025-11-21T10:55:44Z)
SCEESR: Semantic-Control Edge Enhancement for Diffusion-Based Super-Resolution [0.8122270502556375]
Real-world image super-resolution must handle complex degradations and inherent reconstruction ambiguities.<n>One-step diffusion models offer speed but often produce structural inaccuracies due to distillation artifacts.<n>We propose a novel SR framework that enhances a one-step diffusion model using a ControlNet mechanism for semantic edge guidance.
arXiv Detail & Related papers (2025-10-22T06:06:01Z)
Progressive Element-wise Gradient Estimation for Neural Network Quantization [2.1413624861650358]
Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions.<n>We propose Progressive Element-wise Gradient Estimation (PEGE) to address discretization errors between continuous and quantized values.<n>PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
arXiv Detail & Related papers (2025-08-27T15:59:36Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
Steepest Descent Density Control for Compact 3D Gaussian Splatting [72.54055499344052]
3D Gaussian Splatting (3DGS) has emerged as a powerful real-time, high-resolution novel view.<n>We propose a theoretical framework that demystifies and improves density control in 3DGS.<n>We introduce SteepGS, incorporating steepest density control, a principled strategy that minimizes loss while maintaining a compact point cloud.
arXiv Detail & Related papers (2025-05-08T18:41:38Z)
GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Sharpness-aware Quantization for Deep Neural Networks [45.150346855368]
Sharpness-Aware Quantization (SAQ) is a novel method to explore the effect of Sharpness-Aware Minimization (SAM) on model compression. We show that SAQ improves the generalization performance of the quantized models, yielding the SOTA results in uniform quantization.
arXiv Detail & Related papers (2021-11-24T05:16:41Z)
A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices. Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data. We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.