Evaluating robustness of You Only Hear Once(YOHO) Algorithm on noisy
audios in the VOICe Dataset
- URL: http://arxiv.org/abs/2111.01205v1
- Date: Mon, 1 Nov 2021 18:58:50 GMT
- Title: Evaluating robustness of You Only Hear Once(YOHO) Algorithm on noisy
audios in the VOICe Dataset
- Authors: Soham Tiwari, Kshitiz Lakhotia, Manjunath Mulimani
- Abstract summary: Sound event detection (SED) in machine listening entails identifying the different sounds in an audio file and identifying the start and end time of a particular sound event in the audio.
In this paper, we explore the performance of the YOHO algorithm on the VOICe dataset containing audio files with noise at different sound-to-noise ratios (SNR)
YOHO could outperform or at least match the best performing SED algorithms reported in the VOICe dataset paper and make inferences in less time.
- Score: 8.48671341519897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound event detection (SED) in machine listening entails identifying the
different sounds in an audio file and identifying the start and end time of a
particular sound event in the audio. SED finds use in various applications such
as audio surveillance, speech recognition, and context-based indexing and
retrieval of data in a multimedia database. However, in real-life scenarios,
the audios from various sources are seldom devoid of any interfering noise or
disturbance. In this paper, we test the performance of the You Only Hear Once
(YOHO) algorithm on noisy audio data. Inspired by the You Only Look Once (YOLO)
algorithm in computer vision, the YOHO algorithm can match the performance of
the various state-of-the-art algorithms on datasets such as Music Speech
Detection Dataset, TUT Sound Event, and Urban-SED datasets but at lower
inference times. In this paper, we explore the performance of the YOHO
algorithm on the VOICe dataset containing audio files with noise at different
sound-to-noise ratios (SNR). YOHO could outperform or at least match the best
performing SED algorithms reported in the VOICe dataset paper and make
inferences in less time.
Related papers
- A contrastive-learning approach for auditory attention detection [11.28441753596964]
We propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal.
We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
arXiv Detail & Related papers (2024-10-24T03:13:53Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection [0.0]
We present a novel approach called You Only Hear Once (YOHO)
We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art Convolutional Recurrent Neural Network.
arXiv Detail & Related papers (2021-09-01T12:50:16Z) - Dual Normalization Multitasking for Audio-Visual Sounding Object
Localization [0.0]
We propose a new concept, Sounding Object, to reduce the ambiguity of the visual location of sound.
To tackle this new AVSOL problem, we propose a novel multitask training strategy and architecture called Dual Normalization Multitasking.
arXiv Detail & Related papers (2021-06-01T02:02:52Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - Multi-label Sound Event Retrieval Using a Deep Learning-based Siamese
Structure with a Pairwise Presence Matrix [11.54047475139282]
State of the art sound event retrieval models have focused on single-label audio recordings.
We propose different Deep Learning architectures with a Siamese-structure and a Pairwise Presence Matrix.
The networks are trained and evaluated using the SONYC-UST dataset containing both single- and multi-label soundscape recordings.
arXiv Detail & Related papers (2020-02-20T21:33:07Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.