Learning a Dual-Mode Speech Recognition Model via Self-Pruning
- URL: http://arxiv.org/abs/2207.11906v1
- Date: Mon, 25 Jul 2022 05:03:13 GMT
- Title: Learning a Dual-Mode Speech Recognition Model via Self-Pruning
- Authors: Chunxi Liu, Yuan Shangguan, Haichuan Yang, Yangyang Shi, Raghuraman
Krishnamoorthi, Ozlem Kalinli
- Abstract summary: This work aims to jointly learn a compact sparse on-device streaming ASR model, and a large dense server non-streaming model, in a single supernet.
We present that, performing supernet training on both wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not only substantially improve the large non-streaming model as shown in prior works, and also be able to improve the compact sparse streaming model.
- Score: 18.248552732790852
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is growing interest in unifying the streaming and full-context
automatic speech recognition (ASR) networks into a single end-to-end ASR model
to simplify the model training and deployment for both use cases. While in
real-world ASR applications, the streaming ASR models typically operate under
more storage and computational constraints - e.g., on embedded devices - than
any server-side full-context models. Motivated by the recent progress in
Omni-sparsity supernet training, where multiple subnetworks are jointly
optimized in one single model, this work aims to jointly learn a compact sparse
on-device streaming ASR model, and a large dense server non-streaming model, in
a single supernet. Next, we present that, performing supernet training on both
wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not
only substantially improve the large non-streaming model as shown in prior
works, and also be able to improve the compact sparse streaming model.
Related papers
- Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression
For On-device ASR Models [30.758876520227666]
TODM is a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job.
We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet.
Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER)
arXiv Detail & Related papers (2023-09-05T04:47:55Z) - A Lexical-aware Non-autoregressive Transformer-based ASR Model [9.500518278458905]
We propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder.
LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge.
arXiv Detail & Related papers (2023-05-18T09:50:47Z) - Multi-stage Progressive Compression of Conformer Transducer for
On-device Speech Recognition [7.450574974954803]
Small memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models.
Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size.
We propose a multi-stage progressive approach to compress the conformer transducer model using KD.
arXiv Detail & Related papers (2022-10-01T02:23:00Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Self-Attention for Audio Super-Resolution [0.0]
We propose a network architecture for audio super-resolution that combines convolution and self-attention.
Attention-based Feature-Wise Linear Modulation (AFiLM) uses self-attention mechanism instead of recurrent neural networks to modulate the activations of the convolutional model.
arXiv Detail & Related papers (2021-08-26T08:05:07Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.