EML Online Speech Activity Detection for the Fearless Steps Challenge
Phase-III
- URL: http://arxiv.org/abs/2106.11075v1
- Date: Mon, 21 Jun 2021 12:55:51 GMT
- Title: EML Online Speech Activity Detection for the Fearless Steps Challenge
Phase-III
- Authors: Omid Ghahabi, Volker Fischer
- Abstract summary: This paper describes the online algorithm for the most recent phase of Fearless Steps challenge.
The proposed algorithm can be trained both in a supervised and unsupervised manner.
Experiments show a competitive accuracy on both development and evaluation datasets with a real-time factor of about 0.002 using a single CPU machine.
- Score: 7.047338765733677
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speech Activity Detection (SAD), locating speech segments within an audio
recording, is a main part of most speech technology applications. Robust SAD is
usually more difficult in noisy conditions with varying signal-to-noise ratios
(SNR). The Fearless Steps challenge has recently provided such data from the
NASA Apollo-11 mission for different speech processing tasks including SAD.
Most audio recordings are degraded by different kinds and levels of noise
varying within and between channels. This paper describes the EML online
algorithm for the most recent phase of this challenge. The proposed algorithm
can be trained both in a supervised and unsupervised manner and assigns speech
and non-speech labels at runtime approximately every 0.1 sec. The experimental
results show a competitive accuracy on both development and evaluation datasets
with a real-time factor of about 0.002 using a single CPU machine.
Related papers
- A contrastive-learning approach for auditory attention detection [11.28441753596964]
We propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal.
We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
arXiv Detail & Related papers (2024-10-24T03:13:53Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - Towards Event Extraction from Speech with Contextual Clues [61.164413398231254]
We introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set.
Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries.
Our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%.
arXiv Detail & Related papers (2024-01-27T11:07:19Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Implicit Acoustic Echo Cancellation for Keyword Spotting and
Device-Directed Speech Detection [2.7393821783237184]
In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio.
We propose an implicit acoustic echo cancellation framework where a neural network is trained to exploit the additional information from a reference microphone channel.
We show a $56%$ reduction in false-reject rate for the DDD task during device playback conditions.
arXiv Detail & Related papers (2021-11-20T17:21:16Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Detection of Glottal Closure Instants from Speech Signals: a
Quantitative Review [9.351195374919365]
Five state-of-the-art GCI detection algorithms are compared using six different databases.
The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy.
It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy.
arXiv Detail & Related papers (2019-12-28T14:12:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.