Related papers: Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

URL: http://arxiv.org/abs/2407.18879v1
Date: Fri, 26 Jul 2024 17:24:50 GMT
Title: Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model
Authors: Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner, Gary Wang, Andrew Rosenberg, Quan Wang,
Abstract summary: Keywords spotting models require huge amount of training data to be accurate. TTS models can generate large amounts of natural-sounding data, which can help reducing cost and time for KWS model development. We explore various strategies to mix TTS data and real human speech data, with a focus on minimizing real data use and maximizing diversity of TTS output.
Score: 13.45344843458971
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper explores the use of TTS synthesized training data for KWS (keyword spotting) task while minimizing development cost and time. Keyword spotting models require a huge amount of training data to be accurate, and obtaining such training data can be costly. In the current state of the art, TTS models can generate large amounts of natural-sounding data, which can help reducing cost and time for KWS model development. Still, TTS generated data can be lacking diversity compared to real data. To pursue maximizing KWS model accuracy under the constraint of limited resources and current TTS capability, we explored various strategies to mix TTS data and real human speech data, with a focus on minimizing real data use and maximizing diversity of TTS output. Our experimental results indicate that relatively small amounts of real audio data with speaker diversity (100 speakers, 2k utterances) and large amounts of TTS synthesized data can achieve reasonably high accuracy (within 3x error rate of baseline), compared to the baseline (trained with 3.8M real positive utterances).

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch [18.661974399115007]
Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline. This pipeline maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50%.
arXiv Detail & Related papers (2024-12-11T09:38:50Z)
Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting [13.45344843458971]
Keywords spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. We propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data. Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss.
arXiv Detail & Related papers (2024-08-20T00:16:12Z)
Towards Effective and Efficient Continual Pre-training of Large Language Models [163.34610964970258]
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. This paper presents a technical report for continually pre-training Llama-3 (8B) It significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model.
arXiv Detail & Related papers (2024-07-26T13:55:21Z)
Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments [8.103855990028842]
We introduce Synth4Kws - a framework to leverage Text to Speech (TTS) synthesized data for custom KWS. We found increasing TTS phrase diversity and utterance sampling monotonically improves model performance. Our experiments are based on English and single word utterances but the findings generalize to i18n languages.
arXiv Detail & Related papers (2024-07-23T21:05:44Z)
EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks. Our model consists of two stages: Text2Spectrum and SSRN. Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset [10.119929769316565]
This thesis is rooted in the pressing need to find TTS models that require less training time, fewer data samples, yet yield high-quality voice output. The research evaluates TTS state-of-the-art model transfer learning capabilities through a thorough technical analysis. It then conducts a hands-on experimental analysis to compare models' performance in a constrained dataset.
arXiv Detail & Related papers (2023-10-08T03:08:25Z)
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing [57.86954315102865]
DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost. For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
arXiv Detail & Related papers (2022-12-07T12:27:28Z)
Improving Label-Deficient Keyword Spotting Through Self-Supervised Pretraining [18.19207291891767]
Keywords Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants. KWS models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available. Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled data.
arXiv Detail & Related papers (2022-10-04T15:56:27Z)
Improving Neural Machine Translation by Denoising Training [95.96569884410137]
We present a simple and effective pretraining strategy Denoising Training DoT for neural machine translation. We update the model parameters with source- and target-side denoising tasks at the early stage and then tune the model normally. Experiments show DoT consistently improves the neural machine translation performance across 12 bilingual and 16 multilingual directions.
arXiv Detail & Related papers (2022-01-19T00:11:38Z)
A study on the efficacy of model pre-training in developing neural text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z)
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.