Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal
Attention
- URL: http://arxiv.org/abs/2108.04906v1
- Date: Tue, 10 Aug 2021 20:26:44 GMT
- Title: Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal
Attention
- Authors: Kranti Kumar Parida, Siddharth Srivastava, Neeraj Matiyali, Gaurav
Sharma
- Abstract summary: We propose a novel encoder-decoder architecture, where we use a hierarchical attention mechanism to encode the image and depth feature extracted from individual transformer backbone.
We show that adding depth features along with image features improves the performance both qualitatively and quantitatively.
- Score: 17.274928172342978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Binaural audio gives the listener the feeling of being in the recording place
and enhances the immersive experience if coupled with AR/VR. But the problem
with binaural audio recording is that it requires a specialized setup which is
not possible to fabricate within handheld devices as compared to traditional
mono audio that can be recorded with a single microphone. In order to overcome
this drawback, prior works have tried to uplift the mono recorded audio to
binaural audio as a post processing step conditioning on the visual input. But
all the prior approaches missed other most important information required for
the task, i.e. distance of different sound producing objects from the recording
setup. In this work, we argue that the depth map of the scene can act as a
proxy for encoding distance information of objects in the scene and show that
adding depth features along with image features improves the performance both
qualitatively and quantitatively. We propose a novel encoder-decoder
architecture, where we use a hierarchical attention mechanism to encode the
image and depth feature extracted from individual transformer backbone, with
audio features at each layer of the decoder.
Related papers
- Audio-Visual Talker Localization in Video for Spatial Sound Reproduction [3.2472293599354596]
In this research, we detect and locate the active speaker in the video.
We found the role of the two modalities to complement each other.
Future investigations will assess the robustness of the model in noisy and highly reverberant environments.
arXiv Detail & Related papers (2024-06-01T16:47:07Z) - CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames.
We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with
Depth and Cross Modal Attention [19.41528806102547]
Binaural audio gives the listener an immersive experience and can enhance augmented and virtual reality.
Recording audio requires specialized setup with a dummy human head having microphones in left and right ears.
Recent efforts have been directed towards lifting mono audio to audio conditioned on the visual input from the scene.
We propose a novel encoder-decoder architecture with a hierarchical attention mechanism to encode image, depth and audio.
arXiv Detail & Related papers (2021-11-15T19:07:39Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z) - Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation [96.18178553315472]
We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.
We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
arXiv Detail & Related papers (2020-07-20T06:20:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.