Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation
- URL: http://arxiv.org/abs/2105.00708v1
- Date: Mon, 3 May 2021 09:34:11 GMT
- Title: Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation
- Authors: Yan-Bo Lin and Yu-Chiang Frank Wang
- Abstract summary: We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
- Score: 45.526051369551915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human perceives rich auditory experience with distinct sound heard by ears.
Videos recorded with binaural audio particular simulate how human receives
ambient sound. However, a large number of videos are with monaural audio only,
which would degrade the user experience due to the lack of ambient information.
To address this issue, we propose an audio spatialization framework to convert
a monaural video into a binaural one exploiting the relationship across audio
and visual components. By preserving the left-right consistency in both audio
and visual modalities, our learning strategy can be viewed as a self-supervised
learning technique, and alleviates the dependency on a large amount of video
data with ground truth binaural audio data during training. Experiments on
benchmark datasets confirm the effectiveness of our proposed framework in both
semi-supervised and fully supervised scenarios, with ablation studies and
visualization further support the use of our model for audio spatialization.
Related papers
- Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.