Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video
- URL: http://arxiv.org/abs/2111.10882v1
- Date: Sun, 21 Nov 2021 19:26:45 GMT
- Title: Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video
- Authors: Rishabh Garg, Ruohan Gao, Kristen Grauman
- Abstract summary: We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
- Score: 94.42811508809994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Binaural audio provides human listeners with an immersive spatial sound
experience, but most existing videos lack binaural audio recordings. We propose
an audio spatialization method that draws on visual information in videos to
convert their monaural (single-channel) audio to binaural audio. Whereas
existing approaches leverage visual features extracted directly from video
frames, our approach explicitly disentangles the geometric cues present in the
visual stream to guide the learning process. In particular, we develop a
multi-task framework that learns geometry-aware features for binaural audio
generation by accounting for the underlying room impulse response, the visual
stream's coherence with the sound source(s) positions, and the consistency in
geometry of the sounding objects over time. Furthermore, we introduce a new
large video dataset with realistic binaural audio simulated for real-world
scanned environments. On two datasets, we demonstrate the efficacy of our
method, which achieves state-of-the-art results.
Related papers
- SOAF: Scene Occlusion-aware Neural Acoustic Field [9.651041527067907]
We propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation.
Our approach derives a prior for sound energy field using distance-aware parametric sound-propagation modelling.
We extract features from local acoustic field centred around the receiver using a Fibonacci Sphere to generate audio for novel views.
arXiv Detail & Related papers (2024-07-02T13:40:56Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.