You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection
- URL: http://arxiv.org/abs/2109.00962v1
- Date: Wed, 1 Sep 2021 12:50:16 GMT
- Title: You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection
- Authors: Satvik Venkatesh, David Moffat, Eduardo Reck Miranda
- Abstract summary: We present a novel approach called You Only Hear Once (YOHO)
We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art Convolutional Recurrent Neural Network.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio segmentation and sound event detection are crucial topics in machine
listening that aim to detect acoustic classes and their respective boundaries.
It is useful for audio-content analysis, speech recognition, audio-indexing,
and music information retrieval. In recent years, most research articles adopt
segmentation-by-classification. This technique divides audio into small frames
and individually performs classification on these frames. In this paper, we
present a novel approach called You Only Hear Once (YOHO), which is inspired by
the YOLO algorithm popularly adopted in Computer Vision. We convert the
detection of acoustic boundaries into a regression problem instead of
frame-based classification. This is done by having separate output neurons to
detect the presence of an audio class and predict its start and end points.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art
Convolutional Recurrent Neural Network on multiple datasets. As YOHO is purely
a convolutional neural network and has no recurrent layers, it is faster during
inference. In addition, as this approach is more end-to-end and predicts
acoustic boundaries directly, it is significantly quicker during
post-processing and smoothing.
Related papers
- LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Play It Back: Iterative Attention for Audio Recognition [104.628661890361]
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds.
We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
arXiv Detail & Related papers (2022-10-20T15:03:22Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - SepIt: Approaching a Single Channel Speech Separation Bound [99.19786288094596]
We introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation.
In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
arXiv Detail & Related papers (2022-05-24T05:40:36Z) - Evaluating robustness of You Only Hear Once(YOHO) Algorithm on noisy
audios in the VOICe Dataset [8.48671341519897]
Sound event detection (SED) in machine listening entails identifying the different sounds in an audio file and identifying the start and end time of a particular sound event in the audio.
In this paper, we explore the performance of the YOHO algorithm on the VOICe dataset containing audio files with noise at different sound-to-noise ratios (SNR)
YOHO could outperform or at least match the best performing SED algorithms reported in the VOICe dataset paper and make inferences in less time.
arXiv Detail & Related papers (2021-11-01T18:58:50Z) - Robust Feature Learning on Long-Duration Sounds for Acoustic Scene
Classification [54.57150493905063]
Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded.
We propose a robust feature learning (RFL) framework to train the CNN.
arXiv Detail & Related papers (2021-08-11T03:33:05Z) - Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z) - Neural Audio Fingerprint for High-specific Audio Retrieval based on
Contrastive Learning [14.60531205031547]
We present a contrastive learning framework that derives from the segment-level search objective.
In the segment-level search task, where the conventional audio fingerprinting systems used to fail, our system using 10x smaller storage has shown promising results.
arXiv Detail & Related papers (2020-10-22T17:44:40Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.