A Model for Every User and Budget: Label-Free and Personalized
Mixed-Precision Quantization
- URL: http://arxiv.org/abs/2307.12659v2
- Date: Sun, 11 Feb 2024 12:07:55 GMT
- Title: A Model for Every User and Budget: Label-Free and Personalized
Mixed-Precision Quantization
- Authors: Edward Fish, Umberto Michieli, Mete Ozay
- Abstract summary: We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain.
MyQASR generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning.
Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.
- Score: 23.818922559567994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancement in Automatic Speech Recognition (ASR) has produced large
AI models, which become impractical for deployment in mobile devices. Model
quantization is effective to produce compressed general-purpose models, however
such models may only be deployed to a restricted sub-domain of interest. We
show that ASR models can be personalized during quantization while relying on
just a small set of unlabelled samples from the target domain. To this end, we
propose myQASR, a mixed-precision quantization method that generates tailored
quantization schemes for diverse users under any memory requirement with no
fine-tuning. myQASR automatically evaluates the quantization sensitivity of
network layers by analysing the full-precision activation values. We are then
able to generate a personalised mixed-precision quantization scheme for any
pre-determined memory budget. Results for large-scale ASR models show how
myQASR improves performance for specific genders, languages, and speakers.
Related papers
- Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - Enhancing Quantised End-to-End ASR Models via Personalisation [12.971231464928806]
We propose a novel strategy of personalisation for a quantised model (PQM)
PQM uses a 4-bit NormalFloat Quantisation (NF4) approach for model quantisation and low-rank adaptation (LoRA) for SAT.
Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora.
arXiv Detail & Related papers (2023-09-17T02:35:21Z) - Precision-Recall Divergence Optimization for Generative Modeling with
GANs and Normalizing Flows [54.050498411883495]
We develop a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows.
We show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the textitPR-divergences.
Our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
arXiv Detail & Related papers (2023-05-30T10:07:17Z) - Modular Quantization-Aware Training for 6D Object Pose Estimation [52.9436648014338]
Edge applications demand efficient 6D object pose estimation on resource-constrained embedded platforms.
We introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy.
MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.
arXiv Detail & Related papers (2023-03-12T21:01:54Z) - Distributional Learning of Variational AutoEncoder: Application to
Synthetic Data Generation [0.7614628596146602]
We propose a new approach that expands the model capacity without sacrificing the computational advantages of the VAE framework.
Our VAE model's decoder is composed of an infinite mixture of asymmetric Laplace distribution.
We apply the proposed model to synthetic data generation, and particularly, our model demonstrates superiority in easily adjusting the level of data privacy.
arXiv Detail & Related papers (2023-02-22T11:26:50Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - AMED: Automatic Mixed-Precision Quantization for Edge Devices [3.5223695602582614]
Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance.
Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths.
arXiv Detail & Related papers (2022-05-30T21:23:22Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - Generative Design of Hardware-aware DNNs [6.144349819246314]
We propose a new way for autonomous quantization and HW-aware tuning.
A generative model, AQGAN, takes a target accuracy as the condition and generates a suite of quantization configurations.
We evaluate our model on five of the widely-used efficient models on the ImageNet dataset.
arXiv Detail & Related papers (2020-06-06T20:39:25Z) - Regularized Autoencoders via Relaxed Injective Probability Flow [35.39933775720789]
Invertible flow-based generative models are an effective method for learning to generate samples, while allowing for tractable likelihood computation and inference.
We propose a generative model based on probability flows that does away with the bijectivity requirement on the model and only assumes injectivity.
arXiv Detail & Related papers (2020-02-20T18:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.