Related papers: Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

URL: http://arxiv.org/abs/2105.00708v1
Date: Mon, 3 May 2021 09:34:11 GMT
Title: Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
Authors: Yan-Bo Lin and Yu-Chiang Frank Wang
Abstract summary: We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components. Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
Score: 45.526051369551915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human perceives rich auditory experience with distinct sound heard by ears. Videos recorded with binaural audio particular simulate how human receives ambient sound. However, a large number of videos are with monaural audio only, which would degrade the user experience due to the lack of ambient information. To address this issue, we propose an audio spatialization framework to convert a monaural video into a binaural one exploiting the relationship across audio and visual components. By preserving the left-right consistency in both audio and visual modalities, our learning strategy can be viewed as a self-supervised learning technique, and alleviates the dependency on a large amount of video data with ground truth binaural audio data during training. Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios, with ablation studies and visualization further support the use of our model for audio spatialization.

Related papers

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation [21.58489462776634]
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts. Current models risk overfitting to room environments and lose fine-grained spatial details. We propose a new audio-visual generation model incorporating an audio-visual conditional normalisation layer.
arXiv Detail & Related papers (2025-01-06T06:04:21Z)
Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z)
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z)
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF. We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z)
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z)
Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings. We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios. Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.