Few-Shot Audio-Visual Learning of Environment Acoustics
- URL: http://arxiv.org/abs/2206.04006v1
- Date: Wed, 8 Jun 2022 16:38:24 GMT
- Title: Few-Shot Audio-Visual Learning of Environment Acoustics
- Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman
- Abstract summary: Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener.
We explore how to infer RIRs based on a sparse set of images and echoes observed in the space.
In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs.
- Score: 89.16560042178523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Room impulse response (RIR) functions capture how the surrounding physical
environment transforms the sounds heard by a listener, with implications for
various applications in AR, VR, and robotics. Whereas traditional methods to
estimate RIRs assume dense geometry and/or sound measurements throughout the
environment, we explore how to infer RIRs based on a sparse set of images and
echoes observed in the space. Towards that goal, we introduce a
transformer-based method that uses self-attention to build a rich acoustic
context, then predicts RIRs of arbitrary query source-receiver locations
through cross-attention. Additionally, we design a novel training objective
that improves the match in the acoustic signature between the RIR predictions
and the targets. In experiments using a state-of-the-art audio-visual simulator
for 3D environments, we demonstrate that our method successfully generates
arbitrary RIRs, outperforming state-of-the-art methods and--in a major
departure from traditional methods--generalizing to novel environments in a
few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Hearing Anything Anywhere [26.415266601469767]
We introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene.
This allows us to synthesize novel auditory experiences through the space with any source audio.
We show that our model outperforms state-ofthe-art baselines on rendering monaural and RIRs and music at unseen locations.
arXiv Detail & Related papers (2024-06-11T17:56:14Z) - ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling [57.1025908604556]
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment.
We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment.
We introduce ActiveRIR, a reinforcement learning policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions.
arXiv Detail & Related papers (2024-04-24T21:30:01Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios [36.50731790624643]
We introduce RIR-SF, a novel spatial feature based on room impulse response (RIR)
RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance.
We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3% reduction in CER for target speaker ASR in multi-channel settings.
arXiv Detail & Related papers (2023-10-31T20:42:08Z) - Neural Acoustic Context Field: Rendering Realistic Room Impulse Response
With Neural Fields [61.07542274267568]
This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene.
Driven by the unique properties of RIR, we design a temporal correlation module and multi-scale energy decay criterion.
Experimental results show that NACF outperforms existing field-based methods by a notable margin.
arXiv Detail & Related papers (2023-09-27T19:50:50Z) - Synthetic Wave-Geometric Impulse Responses for Improved Speech
Dereverberation [69.1351513309953]
We show that accurately simulating the low-frequency components of Room Impulse Responses (RIRs) is important to achieving good dereverberation.
We demonstrate that speech dereverberation models trained on hybrid synthetic RIRs outperform models trained on RIRs generated by prior geometric ray tracing methods.
arXiv Detail & Related papers (2022-12-10T20:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.