Speech Enhancement for Wake-Up-Word detection in Voice Assistants
- URL: http://arxiv.org/abs/2101.12732v1
- Date: Fri, 29 Jan 2021 18:44:05 GMT
- Title: Speech Enhancement for Wake-Up-Word detection in Voice Assistants
- Authors: David Bonet, Guillermo C\'ambara, Fernando L\'opez, Pablo G\'omez,
Carlos Segura, Jordi Luque
- Abstract summary: Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
- Score: 60.103753056973815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Keyword spotting and in particular Wake-Up-Word (WUW) detection is a very
important task for voice assistants. A very common issue of voice assistants is
that they get easily activated by background noise like music, TV or background
speech that accidentally triggers the device. In this paper, we propose a
Speech Enhancement (SE) model adapted to the task of WUW detection that aims at
increasing the recognition rate and reducing the false alarms in the presence
of these types of noises. The SE model is a fully-convolutional denoising
auto-encoder at waveform level and is trained using a log-Mel Spectrogram and
waveform reconstruction losses together with the BCE loss of a simple WUW
classification network. A new database has been purposely prepared for the task
of recognizing the WUW in challenging conditions containing negative samples
that are very phonetically similar to the keyword. The database is extended
with public databases and an exhaustive data augmentation to simulate different
noises and environments. The results obtained by concatenating the SE with a
simple and state-of-the-art WUW detectors show that the SE does not have a
negative impact on the recognition rate in quiet environments while increasing
the performance in the presence of noise, especially when the SE and WUW
detector are trained jointly end-to-end.
Related papers
- VoxWatch: An open-set speaker recognition benchmark on VoxCeleb [10.84962993456577]
Open-set speaker identification (OSI) deals with determining if a test speech sample belongs to a speaker from a set of pre-enrolled individuals (in-set) or if it is from an out-of-set speaker.
As the size of the in-set speaker population grows, the out-of-set scores become larger, leading to increased false alarm rates.
We present the first public benchmark for OSI, developed using the VoxCeleb dataset.
arXiv Detail & Related papers (2023-06-30T23:11:38Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Device-Directed Speech Detection: Regularization via Distillation for
Weakly-Supervised Models [13.456066434598155]
We address the problem of detecting speech directed to a device that does not contain a specific wake-word.
Specifically, we focus on audio coming from a touch-based invocation.
arXiv Detail & Related papers (2022-03-30T01:27:39Z) - Implicit Acoustic Echo Cancellation for Keyword Spotting and
Device-Directed Speech Detection [2.7393821783237184]
In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio.
We propose an implicit acoustic echo cancellation framework where a neural network is trained to exploit the additional information from a reference microphone channel.
We show a $56%$ reduction in false-reject rate for the DDD task during device playback conditions.
arXiv Detail & Related papers (2021-11-20T17:21:16Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.