Binaural Signal Representations for Joint Sound Event Detection and
Acoustic Scene Classification
- URL: http://arxiv.org/abs/2209.05900v1
- Date: Tue, 13 Sep 2022 11:29:00 GMT
- Title: Binaural Signal Representations for Joint Sound Event Detection and
Acoustic Scene Classification
- Authors: Daniel Aleksander Krause, Annamaria Mesaros
- Abstract summary: Sound event detection (SED) and Acoustic scene classification (ASC) are two widely researched audio tasks that constitute an important part of research on acoustic scene analysis.
Considering shared information between sound events and acoustic scenes, performing both tasks jointly is a natural part of a complex machine listening system.
In this paper, we investigate the usefulness of several spatial audio features in training a joint deep neural network (DNN) model performing SED and ASC.
- Score: 3.300149824239397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sound event detection (SED) and Acoustic scene classification (ASC) are two
widely researched audio tasks that constitute an important part of research on
acoustic scene analysis. Considering shared information between sound events
and acoustic scenes, performing both tasks jointly is a natural part of a
complex machine listening system. In this paper, we investigate the usefulness
of several spatial audio features in training a joint deep neural network (DNN)
model performing SED and ASC. Experiments are performed for two different
datasets containing binaural recordings and synchronous sound event and
acoustic scene labels to analyse the differences between performing SED and ASC
separately or jointly. The presented results show that the use of specific
binaural features, mainly the Generalized Cross Correlation with Phase
Transform (GCC-phat) and sines and cosines of phase differences, result in a
better performing model in both separate and joint tasks as compared with
baseline methods based on logmel energies only.
Related papers
- Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Robust, General, and Low Complexity Acoustic Scene Classification
Systems and An Effective Visualization for Presenting a Sound Scene Context [53.80051967863102]
We present a comprehensive analysis of Acoustic Scene Classification (ASC)
We propose an inception-based and low footprint ASC model, referred to as the ASC baseline.
Next, we improve the ASC baseline by proposing a novel deep neural network architecture.
arXiv Detail & Related papers (2022-10-16T19:07:21Z) - Joint Direction and Proximity Classification of Overlapping Sound Events
from Binaural Audio [7.050270263489538]
We aim to investigate several ways of performing joint proximity and direction estimation from recordings.
Considering the limitations of audio, we propose two methods of splitting the sphere into angular areas in order to obtain a set of directional classes.
We propose various ways of combining the proximity and direction estimation problems into a joint task providing temporal information about the onsets and offsets of appearing sources.
arXiv Detail & Related papers (2021-07-26T08:48:46Z) - DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic
Sound Event Localization and Detection [16.18806719313959]
We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival.
We show that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin.
arXiv Detail & Related papers (2021-06-29T09:18:30Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Cyclic Co-Learning of Sounding Object Visual Grounding and Sound
Separation [52.550684208734324]
We propose a cyclic co-learning paradigm that can jointly learn sounding object visual grounding and audio-visual sound separation.
In this paper, we show that the proposed framework outperforms the compared recent approaches on both tasks.
arXiv Detail & Related papers (2021-04-05T17:30:41Z) - Investigations on Audiovisual Emotion Recognition in Noisy Conditions [43.40644186593322]
We present an investigation on two emotion datasets with superimposed noise at different signal-to-noise ratios.
The results show a significant performance decrease when a model trained on clean audio is applied to noisy data.
arXiv Detail & Related papers (2021-03-02T17:45:16Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - Multi-label Sound Event Retrieval Using a Deep Learning-based Siamese
Structure with a Pairwise Presence Matrix [11.54047475139282]
State of the art sound event retrieval models have focused on single-label audio recordings.
We propose different Deep Learning architectures with a Siamese-structure and a Pairwise Presence Matrix.
The networks are trained and evaluated using the SONYC-UST dataset containing both single- and multi-label soundscape recordings.
arXiv Detail & Related papers (2020-02-20T21:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.