BAST: Binaural Audio Spectrogram Transformer for Binaural Sound
Localization
- URL: http://arxiv.org/abs/2207.03927v1
- Date: Fri, 8 Jul 2022 14:27:52 GMT
- Title: BAST: Binaural Audio Spectrogram Transformer for Binaural Sound
Localization
- Authors: Sheng Kuang, Kiki van der Heijden, Siamak Mehrkanoon
- Abstract summary: We propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model to predict the sound azimuth in both anechoic and reverberation environments.
Our model with subtraction interaural integration and hybrid loss achieves an angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all azimuths.
- Score: 3.5665681694253903
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Accurate sound localization in a reverberation environment is essential for
human auditory perception. Recently, Convolutional Neural Networks (CNNs) have
been utilized to model the binaural human auditory pathway. However, CNN shows
barriers in capturing the global acoustic features. To address this issue, we
propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model
to predict the sound azimuth in both anechoic and reverberation environments.
Two modes of implementation, i.e. BAST-SP and BAST-NSP corresponding to BAST
model with shared and non-shared parameters respectively, are explored. Our
model with subtraction interaural integration and hybrid loss achieves an
angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all
azimuths, significantly surpassing CNN based model. The exploratory analysis of
the BAST's performance on the left-right hemifields and anechoic and
reverberation environments shows its generalization ability as well as the
feasibility of binaural Transformers in sound localization. Furthermore, the
analysis of the attention maps is provided to give additional insights on the
interpretation of the localization process in a natural reverberant
environment.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Neural Acoustic Context Field: Rendering Realistic Room Impulse Response
With Neural Fields [61.07542274267568]
This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene.
Driven by the unique properties of RIR, we design a temporal correlation module and multi-scale energy decay criterion.
Experimental results show that NACF outperforms existing field-based methods by a notable margin.
arXiv Detail & Related papers (2023-09-27T19:50:50Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - Joint Direction and Proximity Classification of Overlapping Sound Events
from Binaural Audio [7.050270263489538]
We aim to investigate several ways of performing joint proximity and direction estimation from recordings.
Considering the limitations of audio, we propose two methods of splitting the sphere into angular areas in order to obtain a set of directional classes.
We propose various ways of combining the proximity and direction estimation problems into a joint task providing temporal information about the onsets and offsets of appearing sources.
arXiv Detail & Related papers (2021-07-26T08:48:46Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.