Differentiable Room Acoustic Rendering with Multi-View Vision Priors
- URL: http://arxiv.org/abs/2504.21847v2
- Date: Sat, 16 Aug 2025 21:26:06 GMT
- Title: Differentiable Room Acoustic Rendering with Multi-View Vision Priors
- Authors: Derong Jin, Ruohan Gao,
- Abstract summary: We introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering.<n>Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate.
- Score: 12.30408352143278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.
Related papers
- Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound [5.591620304505415]
This work presents the first formal framework for Audio-Visual World Models (AVWM)<n>It formulates multimodal environment simulation as a partially observable decision process with audio-visual observations, fine-grained actions, and task rewards.<n>We propose an Audio-Visual Conditional Transformer with a novel modality expert architecture that balances visual and auditory learning.
arXiv Detail & Related papers (2025-11-30T13:11:56Z) - Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering [5.528622598353137]
We introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment.<n> SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context.<n>This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR)
arXiv Detail & Related papers (2025-11-14T23:13:02Z) - Resounding Acoustic Fields with Reciprocity [13.126858950459557]
We introduce Versa, a physics-inspired approach to facilitating acoustic field learning.<n>Our method creates physically valid samples with dense virtual emitter positions by exchanging emitter and listener poses.<n>Results show Versa substantially improve the performance of acoustic field learning on both simulated and real-world datasets.
arXiv Detail & Related papers (2025-10-23T14:30:09Z) - Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z) - Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models [13.63552417613795]
We propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models.<n>Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations.
arXiv Detail & Related papers (2025-06-06T21:06:35Z) - Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audiovisual learning (SCAV), which contrasts examples based on their non-aggregated representation space.<n>Experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV.<n>We also show that models trained with SCAV exhibit a significant degree of flexibility regarding the metric employed for retrieval.
arXiv Detail & Related papers (2024-07-08T09:45:20Z) - SOAF: Scene Occlusion-aware Neural Acoustic Field [9.651041527067907]
We propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation.<n>Our approach derives a global prior for the sound field using distance-aware parametric sound-propagation modeling.<n>We extract features from the local acoustic field centered at the receiver using a Fibonacci Sphere to generate audio for novel views.
arXiv Detail & Related papers (2024-07-02T13:40:56Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.<n>Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.<n>We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.<n>Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Hearing Anything Anywhere [26.415266601469767]
We introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene.
This allows us to synthesize novel auditory experiences through the space with any source audio.
We show that our model outperforms state-ofthe-art baselines on rendering monaural and RIRs and music at unseen locations.
arXiv Detail & Related papers (2024-06-11T17:56:14Z) - Multi-Level Neural Scene Graphs for Dynamic Urban Environments [64.26401304233843]
We present a novel, decomposable radiance field approach for dynamic urban environments.
We propose a multi-level neural scene graph representation that scales to thousands of images from dozens of sequences with hundreds of fast-moving objects.
arXiv Detail & Related papers (2024-03-29T21:52:01Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Listen2Scene: Interactive material-aware binaural sound propagation for
reconstructed 3D scenes [69.03289331433874]
We present an end-to-end audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications.
We propose a novel neural-network-based sound propagation method to generate acoustic effects for 3D models of real environments.
arXiv Detail & Related papers (2023-02-02T04:09:23Z) - Audio-Visual Scene Classification Using A Transfer Learning Based Joint
Optimization Strategy [26.975596225131824]
We propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task.
Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training.
arXiv Detail & Related papers (2022-04-25T03:37:02Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.