Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments
- URL: http://arxiv.org/abs/2407.16840v1
- Date: Tue, 23 Jul 2024 21:05:44 GMT
- Title: Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments
- Authors: Pai Zhu, Dhruuv Agarwal, Jacob W. Bartel, Kurt Partridge, Hyun Jin Park, Quan Wang,
- Abstract summary: We introduce Synth4Kws - a framework to leverage Text to Speech (TTS) synthesized data for custom KWS.
We found increasing TTS phrase diversity and utterance sampling monotonically improves model performance.
Our experiments are based on English and single word utterances but the findings generalize to i18n languages.
- Score: 8.103855990028842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the challenges in developing a high quality custom keyword spotting (KWS) model is the lengthy and expensive process of collecting training data covering a wide range of languages, phrases and speaking styles. We introduce Synth4Kws - a framework to leverage Text to Speech (TTS) synthesized data for custom KWS in different resource settings. With no real data, we found increasing TTS phrase diversity and utterance sampling monotonically improves model performance, as evaluated by EER and AUC metrics over 11k utterances of the speech command dataset. In low resource settings, with 50k real utterances as a baseline, we found using optimal amounts of TTS data can improve EER by 30.1% and AUC by 46.7%. Furthermore, we mix TTS data with varying amounts of real data and interpolate the real data needed to achieve various quality targets. Our experiments are based on English and single word utterances but the findings generalize to i18n languages and other keyword types.
Related papers
- ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams [16.172599163455693]
We leverage high-quality data from linguistically or geographically related languages to improve TTS for the target language.
Second, we utilize low-quality Automatic Speech Recognition (ASR) data recorded in non-studio environments, which is refined using denoising and speech enhancement models.
Third, we apply knowledge distillation from large-scale models using synthetic data to generate more robust outputs.
arXiv Detail & Related papers (2024-10-23T14:18:25Z) - Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS [0.0]
This research introduces a comprehensive Bahasa text-to-speech dataset and a novel TTS model, EnGen-TTS.
The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 $pm$ 0.13.
This research marks a significant advancement in Bahasa TTS technology, with implications for diverse language applications.
arXiv Detail & Related papers (2024-10-09T07:01:05Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages [0.0]
Benchmarking plays a pivotal role in assessing and enhancing the performance of compact deep learning models.
Our study introduces a novel, entirely artificially generated benchmarking dataset tailored for speech recognition.
SpokeN-100 consists of spoken numbers from 0 to 99 spoken by 32 different speakers in four different languages.
arXiv Detail & Related papers (2024-03-14T12:07:37Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in
Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models.
We propose reverse KD to rejuvenate more alignments for low-frequency target words.
Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z) - KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset [4.542831770689362]
This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide.
The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers.
It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech applications in both academia and industry.
arXiv Detail & Related papers (2021-04-17T05:49:57Z) - Bootstrap an end-to-end ASR system by multilingual training, transfer
learning, text-to-text mapping and synthetic audio [8.510792628268824]
bootstrapping speech recognition on limited data resources has been an area of active research for long.
We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer based automatic speech recognition (ASR) system in the low resource regime.
Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements.
arXiv Detail & Related papers (2020-11-25T13:11:32Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.