Related papers: Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Training Wake Word Detection with Synthesized Speech Data on Confusion Words

URL: http://arxiv.org/abs/2011.01460v1
Date: Tue, 3 Nov 2020 04:06:04 GMT
Title: Training Wake Word Detection with Synthesized Speech Data on Confusion Words
Authors: Yan Jia, Zexin Cai, Murong Ma, Zeqing Zhao, Xuyang Wang, Junjie Wang, Ming Li
Abstract summary: We investigate two data augmentation setups for training end-to-end KWS systems. One is involving the synthesized data from a multi-speaker speech synthesis system. The other augmentation is performed by adding random noise to the acoustic feature.
Score: 10.97664190706851
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Confusing-words are commonly encountered in real-life keyword spotting applications, which causes severe degradation of performance due to complex spoken terms and various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness on such scenarios, we investigate two data augmentation setups for training end-to-end KWS systems. One is involving the synthesized data from a multi-speaker speech synthesis system, and the other augmentation is performed by adding random noise to the acoustic feature. Experimental results show that augmentations help improve the system's robustness. Moreover, by augmenting the training set with the synthetic data generated by the multi-speaker text-to-speech system, we achieve a significant improvement regarding confusing words scenario.

Related papers

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z)
Improving the Robustness of Summarization Systems with Dual Augmentation [68.53139002203118]
A robust summarization system should be able to capture the gist of the document, regardless of the specific word choices or noise in the input. We first explore the summarization models' robustness against perturbations including word-level synonym substitution and noise. We propose a SummAttacker, which is an efficient approach to generating adversarial samples based on language models.
arXiv Detail & Related papers (2023-06-01T19:04:17Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice. On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task. On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z)
Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations [7.817685358710508]
We propose a system to project recordings and textual descriptions into a shared audio-caption space. Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance. We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
arXiv Detail & Related papers (2022-08-24T11:54:42Z)
Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly. We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection [17.54377669932433]
We propose a deep convolutional neural network-based acoustic word embedding system on code-switching query by example spoken term detection. We combine audio data in two languages for training instead of only using one single language.
arXiv Detail & Related papers (2020-05-24T15:27:56Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance. A studio-quality corpus with manual transcription is necessary to train such seq2seq systems. We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
Phonetic Feedback for Speech Enhancement With and Without Parallel Speech Data [19.66983830788521]
phonetic feedback is rare in speech enhancement research even though it includes valuable top-down information. We use the technique of mimic loss to provide phonetic feedback to an off-the-shelf enhancement system. We show phonetic feedback can improve a state-of-the-art neural enhancement system trained with parallel speech data.
arXiv Detail & Related papers (2020-03-03T20:06:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.