Senone-aware Adversarial Multi-task Training for Unsupervised Child to
Adult Speech Adaptation
- URL: http://arxiv.org/abs/2102.11488v1
- Date: Tue, 23 Feb 2021 04:49:27 GMT
- Title: Senone-aware Adversarial Multi-task Training for Unsupervised Child to
Adult Speech Adaptation
- Authors: Richeng Duan, Nancy F. Chen
- Abstract summary: We propose a feature adaptation approach to minimize acoustic mismatch at the senone (tied triphone states) level between adult and child speech.
We validate the proposed method on three tasks: child speech recognition, child pronunciation assessment, and child fluency score prediction.
- Score: 26.065719754453823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Acoustic modeling for child speech is challenging due to the high acoustic
variability caused by physiological differences in the vocal tract. The dearth
of publicly available datasets makes the task more challenging. In this work,
we propose a feature adaptation approach by exploiting adversarial multi-task
training to minimize acoustic mismatch at the senone (tied triphone states)
level between adult and child speech and leverage large amounts of transcribed
adult speech. We validate the proposed method on three tasks: child speech
recognition, child pronunciation assessment, and child fluency score
prediction. Empirical results indicate that our proposed approach consistently
outperforms competitive baselines, achieving 7.7% relative error reduction on
speech recognition and up to 25.2% relative gains on the evaluation tasks.
Related papers
- Evaluation of state-of-the-art ASR Models in Child-Adult Interactions [27.30130353688078]
Speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting.
We employ LoRA on the best performing zero shot model (whisper-large) to probe the effectiveness of fine-tuning in a low resource setting.
arXiv Detail & Related papers (2024-09-24T14:42:37Z) - Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions [28.5211771482547]
We show that exemplary speech foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate.
Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.
arXiv Detail & Related papers (2024-06-12T05:41:01Z) - Use of Speech Impairment Severity for Dysarthric Speech Recognition [37.93801885333925]
This paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition.
Experiments conducted on UASpeech suggest incorporating speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems.
arXiv Detail & Related papers (2023-05-18T02:42:59Z) - Improving Children's Speech Recognition by Fine-tuning Self-supervised
Adult Speech Representations [2.2191297646252646]
Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies.
Recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity.
We leverage self-supervised adult speech representations and use three well-known child speech corpora to build models for children's speech recognition.
arXiv Detail & Related papers (2022-11-14T22:03:36Z) - Speaker- and Age-Invariant Training for Child Acoustic Modeling Using
Adversarial Multi-Task Learning [19.09026965041249]
A speaker- and age-invariant training approach based on adversarial multi-task learning is proposed.
The system was applied to the OGI speech corpora and achieved a 13% reduction in the WER of the ASR.
arXiv Detail & Related papers (2022-10-19T01:17:40Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker
Identity in Dysarthric Voice Conversion [50.040466658605524]
We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC)
The poor quality of dysarthric speech can be greatly improved by statistical VC.
But as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient.
arXiv Detail & Related papers (2021-06-02T18:41:03Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.