Factorised Speaker-environment Adaptive Training of Conformer Speech
Recognition Systems
- URL: http://arxiv.org/abs/2306.14608v1
- Date: Mon, 26 Jun 2023 11:32:05 GMT
- Title: Factorised Speaker-environment Adaptive Training of Conformer Speech
Recognition Systems
- Authors: Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi
Wang, Shujie Hu, Mengzhe Geng, Xunying Liu
- Abstract summary: This paper proposes a novel factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models.
Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline.
Further analysis shows the proposed method offers potential for rapid adaption to unseen speaker-environment conditions.
- Score: 31.813788489512394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rich sources of variability in natural speech present significant challenges
to current data intensive speech recognition technologies. To model both
speaker and environment level diversity, this paper proposes a novel Bayesian
factorised speaker-environment adaptive training and test time adaptation
approach for Conformer ASR models. Speaker and environment level
characteristics are separately modeled using compact hidden output transforms,
which are then linearly or hierarchically combined to represent any
speaker-environment combination. Bayesian learning is further utilized to model
the adaptation parameter uncertainty. Experiments on the 300-hr WHAM noise
corrupted Switchboard data suggest that factorised adaptation consistently
outperforms the baseline and speaker label only adapted Conformers by up to
3.1% absolute (10.4% relative) word error rate reductions. Further analysis
shows the proposed method offers potential for rapid adaption to unseen
speaker-environment conditions.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Hypernetworks for Personalizing ASR to Atypical Speech [7.486694572792521]
We propose a novel use of a meta-learned hypernetwork to generate highly individualized, utterance-level adaptations on-the-fly for a diverse set of atypical speech characteristics.
We show that hypernetworks generalize better to out-of-distribution speakers, while maintaining an overall relative WER reduction of 75.2% using 0.1% of the full parameter budget.
arXiv Detail & Related papers (2024-06-06T16:39:00Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and
Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods.
This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z) - Unsupervised Personalization of an Emotion Recognition System: The
Unique Properties of the Externalization of Valence in Speech [37.6839508524855]
Adapting a speech emotion recognition system to a particular speaker is a hard problem, especially with deep neural networks (DNNs)
This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set.
We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches.
arXiv Detail & Related papers (2022-01-19T22:14:49Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - A Unified Speaker Adaptation Approach for ASR [37.76683818356052]
We propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation.
For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers.
For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture.
arXiv Detail & Related papers (2021-10-16T10:48:52Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.