Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with
Depth and Cross Modal Attention
- URL: http://arxiv.org/abs/2111.08046v1
- Date: Mon, 15 Nov 2021 19:07:39 GMT
- Title: Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with
Depth and Cross Modal Attention
- Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma
- Abstract summary: Binaural audio gives the listener an immersive experience and can enhance augmented and virtual reality.
Recording audio requires specialized setup with a dummy human head having microphones in left and right ears.
Recent efforts have been directed towards lifting mono audio to audio conditioned on the visual input from the scene.
We propose a novel encoder-decoder architecture with a hierarchical attention mechanism to encode image, depth and audio.
- Score: 19.41528806102547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Binaural audio gives the listener an immersive experience and can enhance
augmented and virtual reality. However, recording binaural audio requires
specialized setup with a dummy human head having microphones in left and right
ears. Such a recording setup is difficult to build and setup, therefore mono
audio has become the preferred choice in common devices. To obtain the same
impact as binaural audio, recent efforts have been directed towards lifting
mono audio to binaural audio conditioned on the visual input from the scene.
Such approaches have not used an important cue for the task: the distance of
different sound producing objects from the microphones. In this work, we argue
that depth map of the scene can act as a proxy for inducing distance
information of different objects in the scene, for the task of audio
binauralization. We propose a novel encoder-decoder architecture with a
hierarchical attention mechanism to encode image, depth and audio feature
jointly. We design the network on top of state-of-the-art transformer networks
for image and depth representation. We show empirically that the proposed
method outperforms state-of-the-art methods comfortably for two challenging
public datasets FAIR-Play and MUSIC-Stereo. We also demonstrate with
qualitative results that the method is able to focus on the right information
required for the task. The project details are available at
\url{https://krantiparida.github.io/projects/bmonobinaural.html}
Related papers
- Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds [118.54908665440826]
Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work develops an approach for scene understanding purely based on sounds.
The co-existence of visual and audio cues is leveraged for supervision transfer.
arXiv Detail & Related papers (2021-09-06T22:24:00Z) - Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal
Attention [17.274928172342978]
We propose a novel encoder-decoder architecture, where we use a hierarchical attention mechanism to encode the image and depth feature extracted from individual transformer backbone.
We show that adding depth features along with image features improves the performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2021-08-10T20:26:44Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation [96.18178553315472]
We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.
We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
arXiv Detail & Related papers (2020-07-20T06:20:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.