Implicit Acoustic Echo Cancellation for Keyword Spotting and
Device-Directed Speech Detection
- URL: http://arxiv.org/abs/2111.10639v1
- Date: Sat, 20 Nov 2021 17:21:16 GMT
- Title: Implicit Acoustic Echo Cancellation for Keyword Spotting and
Device-Directed Speech Detection
- Authors: Samuele Cornell, Thomas Balestri, Thibaud S\'en\'echal
- Abstract summary: In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio.
We propose an implicit acoustic echo cancellation framework where a neural network is trained to exploit the additional information from a reference microphone channel.
We show a $56%$ reduction in false-reject rate for the DDD task during device playback conditions.
- Score: 2.7393821783237184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many speech-enabled human-machine interaction scenarios, user speech can
overlap with the device playback audio. In these instances, the performance of
tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD)
can degrade significantly. To address this problem, we propose an implicit
acoustic echo cancellation (iAEC) framework where a neural network is trained
to exploit the additional information from a reference microphone channel to
learn to ignore the interfering signal and improve detection performance. We
study this framework for the tasks of KWS and DDD on, respectively, an
augmented version of Google Speech Commands v2 and a real-world Alexa device
dataset. Notably, we show a $56\%$ reduction in false-reject rate for the DDD
task during device playback conditions. We also show comparable or superior
performance over a strong end-to-end neural echo cancellation + KWS baseline
for the KWS task with an order of magnitude less computational requirements.
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Robust Active Speaker Detection in Noisy Environments [29.785749048315616]
We formulate a robust active speaker detection (rASD) problem in noisy environments.
Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance.
We propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features.
arXiv Detail & Related papers (2024-03-27T20:52:30Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - EML Online Speech Activity Detection for the Fearless Steps Challenge
Phase-III [7.047338765733677]
This paper describes the online algorithm for the most recent phase of Fearless Steps challenge.
The proposed algorithm can be trained both in a supervised and unsupervised manner.
Experiments show a competitive accuracy on both development and evaluation datasets with a real-time factor of about 0.002 using a single CPU machine.
arXiv Detail & Related papers (2021-06-21T12:55:51Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.