Visually Informed Binaural Audio Generation without Binaural Audios
- URL: http://arxiv.org/abs/2104.06162v1
- Date: Tue, 13 Apr 2021 13:07:33 GMT
- Title: Visually Informed Binaural Audio Generation without Binaural Audios
- Authors: Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin
- Abstract summary: We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
- Score: 130.80178993441413
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Stereophonic audio, especially binaural audio, plays an essential role in
immersive viewing environments. Recent research has explored generating
visually guided stereophonic audios supervised by multi-channel audio
collections. However, due to the requirement of professional recording devices,
existing datasets are limited in scale and variety, which impedes the
generalization of supervised methods in real-world scenarios. In this work, we
propose PseudoBinaural, an effective pipeline that is free of binaural
recordings. The key insight is to carefully build pseudo visual-stereo pairs
with mono data for training. Specifically, we leverage spherical harmonic
decomposition and head-related impulse response (HRIR) to identify the
relationship between spatial locations and received binaural audios. Then in
the visual modality, corresponding visual cues of the mono data are manually
placed at sound source positions to form the pairs. Compared to
fully-supervised paradigms, our binaural-recording-free pipeline shows great
stability in cross-dataset evaluation and achieves comparable performance under
subjective preference. Moreover, combined with binaural recordings, our method
is able to further boost the performance of binaural audio generation under
supervised settings.
Related papers
- Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation [96.18178553315472]
We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.
We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
arXiv Detail & Related papers (2020-07-20T06:20:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.