A Unified Speaker Adaptation Approach for ASR
- URL: http://arxiv.org/abs/2110.08545v1
- Date: Sat, 16 Oct 2021 10:48:52 GMT
- Title: A Unified Speaker Adaptation Approach for ASR
- Authors: Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong
Chng, Bin Ma
- Abstract summary: We propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation.
For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers.
For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture.
- Score: 37.76683818356052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have been used in automatic speech recognition (ASR)
successfully and yields state-of-the-art results. However, its performance is
still affected by speaker mismatch between training and test data. Further
finetuning a trained model with target speaker data is the most natural
approach for adaptation, but it takes a lot of compute and may cause
catastrophic forgetting to the existing speakers. In this work, we propose a
unified speaker adaptation approach consisting of feature adaptation and model
adaptation. For feature adaptation, we employ a speaker-aware persistent memory
model which generalizes better to unseen test speakers by making use of speaker
i-vectors to form a persistent memory. For model adaptation, we use a novel
gradual pruning method to adapt to target speakers without changing the model
architecture, which to the best of our knowledge, has never been explored in
ASR. Specifically, we gradually prune less contributing parameters on model
encoder to a certain sparsity level, and use the pruned parameters for
adaptation, while freezing the unpruned parameters to keep the original model
performance. We conduct experiments on the Librispeech dataset. Our proposed
approach brings relative 2.74-6.52% word error rate (WER) reduction on general
speaker adaptation. On target speaker adaptation, our method outperforms the
baseline with up to 20.58% relative WER reduction, and surpasses the finetuning
method by up to relative 2.54%. Besides, with extremely low-resource adaptation
data (e.g., 1 utterance), our method could improve the WER by relative 6.53%
with only a few epochs of training.
Related papers
- ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for
Low-Resource TTS Adaptation [18.84413550077318]
We propose the use of the "mixture of adapters" method to learn unique characteristics of different speakers.
Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests.
This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind.
arXiv Detail & Related papers (2023-05-29T11:39:01Z) - Differentially Private Adapters for Parameter Efficient Acoustic
Modeling [24.72748979633543]
We introduce a noisy teacher-student ensemble into a conventional adaptation scheme.
We insert residual adapters between layers of the frozen pre-trained acoustic model.
Our solution reduces the number of trainable parameters by 97.5% using the RAs.
arXiv Detail & Related papers (2023-05-19T00:36:43Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation [21.218195769245032]
This paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters.
Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches.
arXiv Detail & Related papers (2022-10-28T03:33:07Z) - On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and
Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods.
This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z) - Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and
Accented Speech [5.960279280033886]
We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning.
We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.
arXiv Detail & Related papers (2021-09-14T20:04:47Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.