Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space
- URL: http://arxiv.org/abs/2409.12745v1
- Date: Thu, 19 Sep 2024 13:07:55 GMT
- Title: Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space
- Authors: Sebastião Quintas, Isabelle Ferrané, Thomas Pellegrini,
- Abstract summary: We conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification.
Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data.
Despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features.
- Score: 10.875499903992782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of synthetic speech as data augmentation is gaining increasing popularity in fields such as automatic speech recognition and speech classification tasks. Despite novel text-to-speech systems with voice cloning capabilities, that allow the usage of a larger amount of voices based on short audio segments, it is known that these systems tend to hallucinate and oftentimes produce bad data that will most likely have a negative impact on the downstream task. In the present work, we conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data, translating to a better performance. Furthermore, despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features, an aspect further explored with a CycleGAN to bridge the gap between the two types of speech material.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Improving Code-Switching and Named Entity Recognition in ASR with Speech
Editing based Data Augmentation [22.38340990398735]
We propose a novel data augmentation method by applying the text-based speech editing model.
The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
arXiv Detail & Related papers (2023-06-14T15:50:13Z) - Towards Selection of Text-to-speech Data to Augment ASR Training [20.115236045164355]
We train a neural network to measure the similarity of a synthetic data to real speech.
We find that incorporating synthetic samples with considerable dissimilarity to real speech is crucial for boosting recognition performance.
arXiv Detail & Related papers (2023-05-30T17:24:28Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.