AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis
- URL: http://arxiv.org/abs/2302.02088v3
- Date: Mon, 16 Oct 2023 15:11:01 GMT
- Title: AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis
- Authors: Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
- Abstract summary: We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
- Score: 61.07542274267568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Can machines recording an audio-visual scene produce realistic, matching
audio-visual experiences at novel positions and novel view directions? We
answer it by studying a new task -- real-world audio-visual scene synthesis --
and a first-of-its-kind NeRF-based approach for multimodal learning.
Concretely, given a video recording of an audio-visual scene, the task is to
synthesize new videos with spatial audios along arbitrary novel camera
trajectories in that scene. We propose an acoustic-aware audio generation
module that integrates prior knowledge of audio propagation into NeRF, in which
we implicitly associate audio generation with the 3D geometry and material
properties of a visual environment. Furthermore, we present a coordinate
transformation module that expresses a view direction relative to the sound
source, enabling the model to learn sound source-centric acoustic fields. To
facilitate the study of this new task, we collect a high-quality Real-World
Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method
on this real-world dataset and the simulation-based SoundSpaces dataset.
Related papers
- SOAF: Scene Occlusion-aware Neural Acoustic Field [9.651041527067907]
We propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation.
Our approach derives a prior for sound energy field using distance-aware parametric sound-propagation modelling.
We extract features from local acoustic field centred around the receiver using a Fibonacci Sphere to generate audio for novel views.
arXiv Detail & Related papers (2024-07-02T13:40:56Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Novel-View Acoustic Synthesis [140.1107768313269]
We introduce the novel-view acoustic synthesis (NVAS) task.
given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint?
We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space.
arXiv Detail & Related papers (2023-01-20T18:49:58Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.