Enhancing Quantised End-to-End ASR Models via Personalisation
- URL: http://arxiv.org/abs/2309.09136v1
- Date: Sun, 17 Sep 2023 02:35:21 GMT
- Title: Enhancing Quantised End-to-End ASR Models via Personalisation
- Authors: Qiuming Zhao and Guangzhi Sun and Chao Zhang and Mingxing Xu and
Thomas Fang Zheng
- Abstract summary: We propose a novel strategy of personalisation for a quantised model (PQM)
PQM uses a 4-bit NormalFloat Quantisation (NF4) approach for model quantisation and low-rank adaptation (LoRA) for SAT.
Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora.
- Score: 12.971231464928806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent end-to-end automatic speech recognition (ASR) models have become
increasingly larger, making them particularly challenging to be deployed on
resource-constrained devices. Model quantisation is an effective solution that
sometimes causes the word error rate (WER) to increase. In this paper, a novel
strategy of personalisation for a quantised model (PQM) is proposed, which
combines speaker adaptive training (SAT) with model quantisation to improve the
performance of heavily compressed models. Specifically, PQM uses a 4-bit
NormalFloat Quantisation (NF4) approach for model quantisation and low-rank
adaptation (LoRA) for SAT. Experiments have been performed on the LibriSpeech
and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size and
1% additional speaker-specific parameters, 15.1% and 23.3% relative WER
reductions were achieved on quantised Whisper and Conformer-based
attention-based encoder-decoder ASR models respectively, comparing to the
original full precision models.
Related papers
- PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models [8.99127212785609]
This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models (ADMs)
Our key contributions include (1) a coverage-driven prompt augmentation method and (2) an activation-aware calibration set generation algorithm for text-conditional ADMs.
Extensive experiments demonstrate PTQ4ADM's capability to reduce the model size by up to 70% while achieving synthesis quality metrics comparable to full-precision models.
arXiv Detail & Related papers (2024-09-20T20:52:56Z) - A Model for Every User and Budget: Label-Free and Personalized
Mixed-Precision Quantization [23.818922559567994]
We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain.
MyQASR generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning.
Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.
arXiv Detail & Related papers (2023-07-24T10:03:28Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes [54.83802872236367]
We propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios.
The proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model.
The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss.
arXiv Detail & Related papers (2022-04-13T04:15:51Z) - 4-bit Conformer with Native Quantization Aware Training for Speech
Recognition [13.997832593421577]
We propose to develop 4-bit ASR models with native quantization aware training, which leverages native integer operations to effectively optimize both training and inference.
We conducted two experiments on state-of-the-art Conformer-based ASR models to evaluate our proposed quantization technique.
For the first time investigated and revealed the viability of 4-bit quantization on a practical ASR system that is trained with large-scale datasets.
arXiv Detail & Related papers (2022-03-29T23:57:15Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z) - An Efficient Deep Learning Model for Automatic Modulation Recognition
Based on Parameter Estimation and Transformation [3.3941243094128035]
This letter proposes an efficient DL-AMR model based on phase parameter estimation and transformation.
Our model is more competitive in training time and test time than the benchmark models with similar recognition accuracy.
arXiv Detail & Related papers (2021-10-11T03:28:28Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.