SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
- URL: http://arxiv.org/abs/2504.05576v1
- Date: Tue, 08 Apr 2025 00:22:16 GMT
- Title: SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
- Authors: Mingfei Chen, Israel D. Gebru, Ishwarya Ananthabhotla, Christian Richardt, Dejan Markovic, Jake Sandakly, Steven Krenn, Todd Keebler, Eli Shlizerman, Alexander Richard,
- Abstract summary: We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints.<n>Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint.
- Score: 51.311553815466446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints. Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint. The method learns the underlying acoustic transfer function that relates the signals acquired at the distributed microphones to the signal at the target viewpoint, using a limited number of known recordings. Unlike existing works, our method does not require constraints or prior knowledge of sound source details. Moreover, our method efficiently adapts to diverse room layouts, reference microphone configurations and unseen environments. To enable this, we introduce a visual-acoustic binding module that learns visual embeddings linked with local acoustic properties from panoramic RGB and depth data. We first leverage these embeddings to optimize the placement of reference microphones in any given scene. During synthesis, we leverage multiple embeddings extracted from reference locations to get adaptive weights for their contribution, conditioned on target viewpoint. We benchmark the task on both publicly available data and real-world settings. We demonstrate significant improvements over existing methods.
Related papers
- SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera [61.642416712939095]
SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source.
We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset.
arXiv Detail & Related papers (2024-12-22T05:04:17Z) - SOAF: Scene Occlusion-aware Neural Acoustic Field [9.651041527067907]
We propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation.<n>Our approach derives a global prior for the sound field using distance-aware parametric sound-propagation modeling.<n>We extract features from the local acoustic field centered at the receiver using a Fibonacci Sphere to generate audio for novel views.
arXiv Detail & Related papers (2024-07-02T13:40:56Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.<n>Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.<n>We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.<n>Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.