Joint Direction and Proximity Classification of Overlapping Sound Events
from Binaural Audio
- URL: http://arxiv.org/abs/2107.12033v1
- Date: Mon, 26 Jul 2021 08:48:46 GMT
- Title: Joint Direction and Proximity Classification of Overlapping Sound Events
from Binaural Audio
- Authors: Daniel Aleksander Krause, Archontis Politis, Annamaria Mesaros
- Abstract summary: We aim to investigate several ways of performing joint proximity and direction estimation from recordings.
Considering the limitations of audio, we propose two methods of splitting the sphere into angular areas in order to obtain a set of directional classes.
We propose various ways of combining the proximity and direction estimation problems into a joint task providing temporal information about the onsets and offsets of appearing sources.
- Score: 7.050270263489538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound source proximity and distance estimation are of great interest in many
practical applications, since they provide significant information for acoustic
scene analysis. As both tasks share complementary qualities, ensuring efficient
interaction between these two is crucial for a complete picture of an aural
environment. In this paper, we aim to investigate several ways of performing
joint proximity and direction estimation from binaural recordings, both defined
as coarse classification problems based on Deep Neural Networks (DNNs).
Considering the limitations of binaural audio, we propose two methods of
splitting the sphere into angular areas in order to obtain a set of directional
classes. For each method we study different model types to acquire information
about the direction-of-arrival (DoA). Finally, we propose various ways of
combining the proximity and direction estimation problems into a joint task
providing temporal information about the onsets and offsets of the appearing
sources. Experiments are performed for a synthetic reverberant binaural dataset
consisting of up to two overlapping sound events.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Sound event localization and classification using WASN in Outdoor Environment [2.234738672139924]
Methods for sound event localization and classification typically rely on a single microphone array.
We propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source.
arXiv Detail & Related papers (2024-03-29T11:44:14Z) - Sound Event Detection and Localization with Distance Estimation [4.139846693958608]
3D SELD is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA)
We study two ways of integrating distance estimation within the SELD core.
Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.
arXiv Detail & Related papers (2024-03-18T14:34:16Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Binaural Signal Representations for Joint Sound Event Detection and
Acoustic Scene Classification [3.300149824239397]
Sound event detection (SED) and Acoustic scene classification (ASC) are two widely researched audio tasks that constitute an important part of research on acoustic scene analysis.
Considering shared information between sound events and acoustic scenes, performing both tasks jointly is a natural part of a complex machine listening system.
In this paper, we investigate the usefulness of several spatial audio features in training a joint deep neural network (DNN) model performing SED and ASC.
arXiv Detail & Related papers (2022-09-13T11:29:00Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Metric-based multimodal meta-learning for human movement identification
via footstep recognition [3.300376360949452]
We describe a novel metric-based learning approach that introduces a multimodal framework.
We learn general-purpose representations from low multisensory data obtained from omnipresent sensing systems.
Our results employ a metric-based contrastive learning approach for multi-sensor data to mitigate the impact of data scarcity.
arXiv Detail & Related papers (2021-11-15T18:46:14Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.