Cross-modal Generative Model for Visual-Guided Binaural Stereo
Generation
- URL: http://arxiv.org/abs/2311.07630v1
- Date: Mon, 13 Nov 2023 09:53:14 GMT
- Title: Cross-modal Generative Model for Visual-Guided Binaural Stereo
Generation
- Authors: Zhaojian Li, Bin Zhao and Yuan Yuan
- Abstract summary: We propose a visually guided generative adversarial approach for generating stereo audio from mono audio.
A metric to measure the spatial perception of audio is proposed for the first time.
The proposed method achieves state-of-the-art performance on 2 datasets and 5 evaluation metrics.
- Score: 18.607236792587614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Binaural stereo audio is recorded by imitating the way the human ear receives
sound, which provides people with an immersive listening experience. Existing
approaches leverage autoencoders and directly exploit visual spatial
information to synthesize binaural stereo, resulting in a limited
representation of visual guidance. For the first time, we propose a visually
guided generative adversarial approach for generating binaural stereo audio
from mono audio. Specifically, we develop a Stereo Audio Generation Model
(SAGM), which utilizes shared spatio-temporal visual information to guide the
generator and the discriminator to work separately. The shared visual
information is updated alternately in the generative adversarial stage,
allowing the generator and discriminator to deliver their respective guided
knowledge while visually sharing. The proposed method learns bidirectional
complementary visual information, which facilitates the expression of visual
guidance in generation. In addition, spatial perception is a crucial attribute
of binaural stereo audio, and thus the evaluation of stereo spatial perception
is essential. However, previous metrics failed to measure the spatial
perception of audio. To this end, a metric to measure the spatial perception of
audio is proposed for the first time. The proposed metric is capable of
measuring the magnitude and direction of spatial perception in the temporal
dimension. Further, considering its function, it is feasible to utilize it
instead of demanding user studies to some extent. The proposed method achieves
state-of-the-art performance on 2 datasets and 5 evaluation metrics.
Qualitative experiments and user studies demonstrate that the method generates
space-realistic stereo audio.
Related papers
- Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation [32.24603883810094]
Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.
We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources.
By leveraging spatial guidance, our unified model achieves the objective of generating immersive and controllable spatial audio from text and image.
arXiv Detail & Related papers (2024-10-14T16:18:29Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation [96.18178553315472]
We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.
We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
arXiv Detail & Related papers (2020-07-20T06:20:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.