Related papers: Quantizing Whisper-small: How design choices affect ASR performance

Quantizing Whisper-small: How design choices affect ASR performance

URL: http://arxiv.org/abs/2511.08093v1
Date: Wed, 12 Nov 2025 01:39:12 GMT
Title: Quantizing Whisper-small: How design choices affect ASR performance
Authors: Arthur Söhler, Julian Irigoyen, Andreas Søeborg Kirkedal,
Abstract summary: We present a unified, cross-library evaluation of post-training quantization on Whisper-small.<n>Our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.

Related papers

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z)
Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models [0.5964436882344729]
We introduce a sliced Wasserstein loss function for distribution-aware calibration in ultra-low-bit post-training quantization.<n>The proposed loss aligns the output distributions of full-precision and quantized models under random linear projections.<n>We demonstrate the performance gains of our proposed model by incorporating it with two frontier methods known as OmniQuant and TesseraQ.
arXiv Detail & Related papers (2026-01-11T15:14:05Z)
BAQ: Efficient Bit Allocation Quantization for Large Language Models [8.427223431012454]
Post-training model quantization is a widely adopted technique for reducing memory and computational costs of large language models.<n>Most existing methods rely on uniform or bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise.<n>We propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy.
arXiv Detail & Related papers (2025-06-06T01:27:01Z)
Quantization Meets Reasoning: Exploring and Mitigating Degradation of Low-Bit LLMs in Mathematical Reasoning [39.56908863102256]
Low-bit post-training quantization impairs mathematical reasoning up to 69.81% in harder settings.<n>We address two deployment-critical questions with process-level precision.<n>In our settings, as few as 332 curated examples and 3--5 minutes of compute on a single GPU recover 4-bit weight math reasoning toward the full-precision baseline.
arXiv Detail & Related papers (2025-05-16T12:11:40Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals [10.860081994662645]
Post-training quantization of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time.<n>We propose ResQ, a PTQ method that pushes further the state-of-the-art.<n>We demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks.
arXiv Detail & Related papers (2024-12-18T22:01:55Z)
QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning [16.50084447690437]
The study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, textbfQuantTune. Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT.
arXiv Detail & Related papers (2024-03-11T08:09:30Z)
Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective [74.48124653728422]
Post-training quantization (PTQ) is widely regarded as one of the most efficient compression methods practically. We argue that an overlooked problem of oscillation is in the PTQ methods.
arXiv Detail & Related papers (2023-03-21T14:52:52Z)
Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data. Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z)
Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z)
Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models. We show negligible WER change as compared to the full-precision baseline models. Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.