Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping
- URL: http://arxiv.org/abs/2206.09396v1
- Date: Sun, 19 Jun 2022 12:57:47 GMT
- Title: Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping
- Authors: Jenthe Thienpondt and Kris Demuynck
- Abstract summary: We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
- Score: 11.584388304271029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Speech Recognition (ASR) systems are known to exhibit difficulties
when transcribing children's speech. This can mainly be attributed to the
absence of large children's speech corpora to train robust ASR models and the
resulting domain mismatch when decoding children's speech with systems trained
on adult data. In this paper, we propose multiple enhancements to alleviate
these issues. First, we propose a data augmentation technique based on the
source-filter model of speech to close the domain gap between adult and
children's speech. This enables us to leverage the data availability of adult
speech corpora by making these samples perceptually similar to children's
speech. Second, using this augmentation strategy, we apply transfer learning on
a Transformer model pre-trained on adult data. This model follows the recently
introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several
cross-lingual adult speech corpora to learn general and robust acoustic
frame-level representations. Adopting this model for the ASR task using adult
data augmented with the proposed source-filter warping strategy and a limited
amount of in-domain children's speech significantly outperforms previous
state-of-the-art results on the PF-STAR British English Children's Speech
corpus with a 4.86% WER on the official test set.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Improving child speech recognition with augmented child-like speech [20.709414063132627]
Cross-lingual child-to-child voice conversion significantly improved child ASR performance.
State-of-the-art ASRs show suboptimal performance for child speech.
arXiv Detail & Related papers (2024-06-12T08:56:46Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Miipher: A Robust Speech Restoration Model Integrating Self-Supervised
Speech and Text Representations [51.89856133895233]
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones.
In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application.
To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature.
arXiv Detail & Related papers (2023-03-03T01:57:16Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Improving Children's Speech Recognition by Fine-tuning Self-supervised
Adult Speech Representations [2.2191297646252646]
Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies.
Recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity.
We leverage self-supervised adult speech representations and use three well-known child speech corpora to build models for children's speech recognition.
arXiv Detail & Related papers (2022-11-14T22:03:36Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z) - Generative Adversarial Training Data Adaptation for Very Low-resource
Automatic Speech Recognition [31.808145263757105]
We use CycleGAN-based non-parallel voice conversion technology to forge a labeled training data that is close to the test speaker's speech.
We evaluate this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi.
arXiv Detail & Related papers (2020-05-19T07:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.