Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting
- URL: http://arxiv.org/abs/2408.10463v1
- Date: Tue, 20 Aug 2024 00:16:12 GMT
- Title: Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting
- Authors: Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner, Gary Wang, Andrew Rosenberg, Quan Wang,
- Abstract summary: Keywords spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations.
We propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data.
Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss.
- Score: 13.45344843458971
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The keyword spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. Utilizing large amounts of text-to-speech (TTS) synthesized data can reduce the cost and time associated with KWS development. However, TTS data may contain artifacts not present in real speech, which the KWS model can exploit (overfit), leading to degraded accuracy on real speech. To address this issue, we propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data. Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss. Surprisingly, we also observed that the adversarial setup improves accuracy by up to 8%, even when trained solely on TTS and real negative speech data, without any real positive examples.
Related papers
- SpoofCeleb: Speech Deepfake Detection and SASV In The Wild [76.71096751337888]
SpoofCeleb is a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV)
We utilize source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data.
SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions.
arXiv Detail & Related papers (2024-09-18T23:17:02Z) - Disentangled Training with Adversarial Examples For Robust Small-footprint Keyword Spotting [18.456711824241978]
We propose datasource-aware disentangled learning with adversarial examples to improve KWS robustness.
Experimental results demonstrate that the proposed learning strategy improves false reject rate by $40.31%$ at $1%$ false accept rate.
Our best-performing system achieves $98.06%$ accuracy on the Google Speech Commands V1 dataset.
arXiv Detail & Related papers (2024-08-23T20:03:51Z) - Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model [13.45344843458971]
Keywords spotting models require huge amount of training data to be accurate.
TTS models can generate large amounts of natural-sounding data, which can help reducing cost and time for KWS model development.
We explore various strategies to mix TTS data and real human speech data, with a focus on minimizing real data use and maximizing diversity of TTS output.
arXiv Detail & Related papers (2024-07-26T17:24:50Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Improving Label-Deficient Keyword Spotting Through Self-Supervised
Pretraining [18.19207291891767]
Keywords Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants.
KWS models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available.
Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled data.
arXiv Detail & Related papers (2022-10-04T15:56:27Z) - Speech Augmentation Based Unsupervised Learning for Keyword Spotting [29.87252331166527]
We designed a CNN-Attention architecture to conduct the KWS task.
We also proposed an unsupervised learning method to improve the robustness of KWS model.
In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods.
arXiv Detail & Related papers (2022-05-28T04:11:31Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion.
We investigated training TTS from either manual or ASR-generated transcripts.
Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.