Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation
- URL: http://arxiv.org/abs/2007.09902v1
- Date: Mon, 20 Jul 2020 06:20:26 GMT
- Title: Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation
- Authors: Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, Ziwei Liu
- Abstract summary: We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.
We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
- Score: 96.18178553315472
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stereophonic audio is an indispensable ingredient to enhance human auditory
experience. Recent research has explored the usage of visual information as
guidance to generate binaural or ambisonic audio from mono ones with stereo
supervision. However, this fully supervised paradigm suffers from an inherent
drawback: the recording of stereophonic audio usually requires delicate devices
that are expensive for wide accessibility. To overcome this challenge, we
propose to leverage the vastly available mono data to facilitate the generation
of stereophonic audio. Our key observation is that the task of visually
indicated audio separation also maps independent audios to their corresponding
visual positions, which shares a similar objective with stereophonic audio
generation. We integrate both stereo generation and source separation into a
unified framework, Sep-Stereo, by considering source separation as a particular
type of audio spatialization. Specifically, a novel associative pyramid network
architecture is carefully designed for audio-visual feature fusion. Extensive
experiments demonstrate that our framework can improve the stereophonic audio
generation results while performing accurate sound separation with a shared
backbone.
Related papers
- Cross-modal Generative Model for Visual-Guided Binaural Stereo
Generation [18.607236792587614]
We propose a visually guided generative adversarial approach for generating stereo audio from mono audio.
A metric to measure the spatial perception of audio is proposed for the first time.
The proposed method achieves state-of-the-art performance on 2 datasets and 5 evaluation metrics.
arXiv Detail & Related papers (2023-11-13T09:53:14Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with
Depth and Cross Modal Attention [19.41528806102547]
Binaural audio gives the listener an immersive experience and can enhance augmented and virtual reality.
Recording audio requires specialized setup with a dummy human head having microphones in left and right ears.
Recent efforts have been directed towards lifting mono audio to audio conditioned on the visual input from the scene.
We propose a novel encoder-decoder architecture with a hierarchical attention mechanism to encode image, depth and audio.
arXiv Detail & Related papers (2021-11-15T19:07:39Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.