Speaker- and Age-Invariant Training for Child Acoustic Modeling Using
Adversarial Multi-Task Learning
- URL: http://arxiv.org/abs/2210.10231v1
- Date: Wed, 19 Oct 2022 01:17:40 GMT
- Title: Speaker- and Age-Invariant Training for Child Acoustic Modeling Using
Adversarial Multi-Task Learning
- Authors: Mostafa Shahin, Beena Ahmed, and Julien Epps
- Abstract summary: A speaker- and age-invariant training approach based on adversarial multi-task learning is proposed.
The system was applied to the OGI speech corpora and achieved a 13% reduction in the WER of the ASR.
- Score: 19.09026965041249
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the major challenges in acoustic modelling of child speech is the
rapid changes that occur in the children's articulators as they grow up, their
differing growth rates and the subsequent high variability in the same age
group. These high acoustic variations along with the scarcity of child speech
corpora have impeded the development of a reliable speech recognition system
for children. In this paper, a speaker- and age-invariant training approach
based on adversarial multi-task learning is proposed. The system consists of
one generator shared network that learns to generate speaker- and age-invariant
features connected to three discrimination networks, for phoneme, age, and
speaker. The generator network is trained to minimize the
phoneme-discrimination loss and maximize the speaker- and age-discrimination
losses in an adversarial multi-task learning fashion. The generator network is
a Time Delay Neural Network (TDNN) architecture while the three discriminators
are feed-forward networks. The system was applied to the OGI speech corpora and
achieved a 13% reduction in the WER of the ASR.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit
Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech.
We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement.
Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z) - Leveraging Speaker Embeddings with Adversarial Multi-task Learning for
Age Group Classification [0.0]
We consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups.
Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios.
arXiv Detail & Related papers (2023-01-22T05:01:13Z) - Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural
Text-to-Speech [29.34041347120446]
A conventional generative adversarial network (GAN)-based training algorithm significantly improves the quality of synthetic speech.
We propose a novel training algorithm for a multi-speaker neural text-to-speech (TTS) model based on multi-task adversarial training.
arXiv Detail & Related papers (2022-09-26T10:10:40Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Senone-aware Adversarial Multi-task Training for Unsupervised Child to
Adult Speech Adaptation [26.065719754453823]
We propose a feature adaptation approach to minimize acoustic mismatch at the senone (tied triphone states) level between adult and child speech.
We validate the proposed method on three tasks: child speech recognition, child pronunciation assessment, and child fluency score prediction.
arXiv Detail & Related papers (2021-02-23T04:49:27Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.