Fixed-point quantization aware training for on-device keyword-spotting
- URL: http://arxiv.org/abs/2303.02284v1
- Date: Sat, 4 Mar 2023 01:06:16 GMT
- Title: Fixed-point quantization aware training for on-device keyword-spotting
- Authors: Sashank Macha, Om Oza, Alex Escott, Francesco Caliva, Robbie Armitano,
Santosh Kumar Cheekatmalla, Sree Hari Krishnan Parthasarathi, Yuzong Liu
- Abstract summary: We propose a novel method to train and obtain FXP convolutional keyword-spotting (KWS) models.
We combine our methodology with two quantization-aware-training (QAT) techniques.
We demonstrate that we can reduce execution time by 68% without compromising KWS model's predictive performance.
- Score: 4.4488246947396695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fixed-point (FXP) inference has proven suitable for embedded devices with
limited computational resources, and yet model training is continually
performed in floating-point (FLP). FXP training has not been fully explored and
the non-trivial conversion from FLP to FXP presents unavoidable performance
drop. We propose a novel method to train and obtain FXP convolutional
keyword-spotting (KWS) models. We combine our methodology with two
quantization-aware-training (QAT) techniques - squashed weight distribution and
absolute cosine regularization for model parameters, and propose techniques for
extending QAT over transient variables, otherwise neglected by previous
paradigms. Experimental results on the Google Speech Commands v2 dataset show
that we can reduce model precision up to 4-bit with no loss in accuracy.
Furthermore, on an in-house KWS dataset, we show that our 8-bit FXP-QAT models
have a 4-6% improvement in relative false discovery rate at fixed false reject
rate compared to full precision FLP models. During inference we argue that
FXP-QAT eliminates q-format normalization and enables the use of low-bit
accumulators while maximizing SIMD throughput to reduce user perceived latency.
We demonstrate that we can reduce execution time by 68% without compromising
KWS model's predictive performance or requiring model architectural changes.
Our work provides novel findings that aid future research in this area and
enable accurate and efficient models.
Related papers
- P4Q: Learning to Prompt for Quantization in Visual-language Models [38.87018242616165]
We propose a method that balances fine-tuning and quantization named Prompt for Quantization'' (P4Q)
Our method can effectively reduce the gap between image features and text features caused by low-bit quantization.
Our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 $times$ while achieving 66.94% Top-1 accuracy.
arXiv Detail & Related papers (2024-09-26T08:31:27Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models [5.69541128149828]
Large generative models such as large language models (LLMs) and diffusion models have revolutionized the fields of NLP and computer vision respectively.
In this study, we propose a light-weight quantization aware fine tuning technique using knowledge distillation (KD-QAT) to improve the performance of 4-bit weight quantized LLMs.
We show that ov-freeze results in near floating point precision performance, i.e., less than 0.7% loss of accuracy on Commonsense Reasoning benchmarks.
arXiv Detail & Related papers (2024-03-26T23:51:44Z) - EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models [21.17675493267517]
Post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches to compress and accelerate diffusion models.
We introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency.
Our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency.
arXiv Detail & Related papers (2023-10-05T02:51:53Z) - Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data.
Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z) - SQuAT: Sharpness- and Quantization-Aware Training for BERT [43.049102196902844]
We propose sharpness- and quantization-aware training (SQuAT)
Our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings by 1%.
Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.
arXiv Detail & Related papers (2022-10-13T16:52:19Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.