Related papers: Hearing Anywhere in Any Environment

Hearing Anywhere in Any Environment

URL: http://arxiv.org/abs/2504.10746v2
Date: Wed, 04 Jun 2025 19:59:42 GMT
Title: Hearing Anywhere in Any Environment
Authors: Xiulong Liu, Anurag Kumar, Paul Calamia, Sebastia V. Amengual, Calvin Murdock, Ishwarya Ananthabhotla, Philip Robinson, Eli Shlizerman, Vamsi Krishna Ithapu, Ruohan Gao,
Abstract summary: We present xRIR, a framework for cross-room Room Impulse Response (RIR) prediction.<n>The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples.<n> Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.
Score: 33.566252963174556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geometries and surface materials. We aim to develop a unified model capable of reconstructing the spatial acoustic experience of any environment with minimum additional measurements. To this end, we present xRIR, a framework for cross-room RIR prediction. The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples. To evaluate our method, we introduce ACOUSTICROOMS, a new dataset featuring high-fidelity simulation of over 300,000 RIRs from 260 rooms. Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.

Related papers

Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z)
ROGR: Relightable 3D Objects using Generative Relighting [71.35020300131261]
We introduce ROGR, a novel approach that reconstructs a relightable 3D model of an object captured from multiple views.<n>We train a lighting-conditioned Neural Radiance Field (NeRF) that outputs the object's appearance under any input environmental lighting.<n>We evaluate our approach on the established TensoIR and Stanford-ORB datasets, and showcase our approach on real-world object captures.
arXiv Detail & Related papers (2025-10-03T16:35:22Z)
Remote Sensing-Oriented World Model [14.021235530589246]
World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations.<n>Existing approaches are predominantly evaluated in synthetic environments or constrained scene settings.<n>This paper bridges these gaps by introducing the first framework for world modeling in remote sensing.
arXiv Detail & Related papers (2025-09-22T14:02:39Z)
Explicit Context-Driven Neural Acoustic Modeling for High-Fidelity RIR Generation [17.013738637228553]
We present Mesh-infused Neural Acoustic Field (MiNAF), which queries a rough room mesh at given locations and extracts distance distributions as an explicit representation of local context.<n>Our approach demonstrates that incorporating explicit local geometric features can better guide the neural network in generating more accurate RIR predictions.
arXiv Detail & Related papers (2025-09-18T17:57:07Z)
AV-Surf: Surface-Enhanced Geometry-Aware Novel-View Acoustic Synthesis [4.751910547396398]
Accurately modeling sound propagation with complex real-world environments is essential for Novel View Acoustic Synthesis (NVAS)<n>We propose a surface-enhanced geometry-aware approach for NVAS to improve spatial acoustic modeling.<n>We introduce a dual cross-attention-based transformer integrating geometrical constraints into frequency query to understand the surroundings of the emitter.
arXiv Detail & Related papers (2025-03-17T04:22:53Z)
Hearing Anything Anywhere [26.415266601469767]
We introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene. This allows us to synthesize novel auditory experiences through the space with any source audio. We show that our model outperforms state-ofthe-art baselines on rendering monaural and RIRs and music at unseen locations.
arXiv Detail & Related papers (2024-06-11T17:56:14Z)
ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling [57.1025908604556]
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment. We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment. We introduce ActiveRIR, a reinforcement learning policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions.
arXiv Detail & Related papers (2024-04-24T21:30:01Z)
RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications [55.24463002889]
We focus on depth data synthesis and develop a range-aware RGB-D data simulation pipeline (RaSim) In particular, high-fidelity depth data is generated by imitating the imaging principle of real-world sensors. RaSim can be directly applied to real-world scenarios without any finetuning and excel at downstream RGB-D perception tasks.
arXiv Detail & Related papers (2024-04-05T08:52:32Z)
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities. RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z)
AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z)
Synthetic Wave-Geometric Impulse Responses for Improved Speech Dereverberation [69.1351513309953]
We show that accurately simulating the low-frequency components of Room Impulse Responses (RIRs) is important to achieving good dereverberation. We demonstrate that speech dereverberation models trained on hybrid synthetic RIRs outperform models trained on RIRs generated by prior geometric ray tracing methods.
arXiv Detail & Related papers (2022-12-10T20:15:23Z)
DARF: Depth-Aware Generalizable Neural Radiance Field [51.29437249009986]
We propose the Depth-Aware Generalizable Neural Radiance Field (DARF) with a Depth-Aware Dynamic Sampling (DADS) strategy. Our framework infers the unseen scenes on both pixel level and geometry level with only a few input images. Compared with state-of-the-art generalizable NeRF methods, DARF reduces samples by 50%, while improving rendering quality and depth estimation.
arXiv Detail & Related papers (2022-12-05T14:00:59Z)
Few-Shot Audio-Visual Learning of Environment Acoustics [89.16560042178523]
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener. We explore how to infer RIRs based on a sparse set of images and echoes observed in the space. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs.
arXiv Detail & Related papers (2022-06-08T16:38:24Z)
IllumiNet: Transferring Illumination from Planar Surfaces to Virtual Objects in Augmented Reality [38.83696624634213]
This paper presents an illumination estimation method for virtual objects in real environment by learning. Given a single RGB image, our method directly infers the relit virtual object by transferring the illumination features extracted from planar surfaces in the scene to the desired geometries.
arXiv Detail & Related papers (2020-07-12T13:11:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.