Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound
Event Localization and Detection in Realistic Rooms
- URL: http://arxiv.org/abs/2401.12238v1
- Date: Fri, 19 Jan 2024 19:01:13 GMT
- Title: Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound
Event Localization and Detection in Realistic Rooms
- Authors: Iran R. Roman, Christopher Ick, Sivan Ding, Adrian S. Roman, Brian
McFee, Juan P. Bello
- Abstract summary: Sound event localization and detection (SELD) is an important task in machine listening.
We present SpatialScaper, a library for SELD data simulation and augmentation.
- Score: 4.266697413924045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sound event localization and detection (SELD) is an important task in machine
listening. Major advancements rely on simulated data with sound events in
specific rooms and strong spatio-temporal labels. SELD data is simulated by
convolving spatialy-localized room impulse responses (RIRs) with sound
waveforms to place sound events in a soundscape. However, RIRs require manual
collection in specific rooms. We present SpatialScaper, a library for SELD data
simulation and augmentation. Compared to existing tools, SpatialScaper emulates
virtual rooms via parameters such as size and wall absorption. This allows for
parameterized placement (including movement) of foreground and background sound
sources. SpatialScaper also includes data augmentation pipelines that can be
applied to existing SELD data. As a case study, we use SpatialScaper to add
rooms to the DCASE SELD data. Training a model with our data led to progressive
performance improves as a direct function of acoustic diversity. These results
show that SpatialScaper is valuable to train robust SELD models.
Related papers
- EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling [69.96729022219117]
When sound waves hit an object, they induce vibrations that produce high-frequency and subtle visual changes.
Recent advances in event camera hardware show good potential for its application in visual sound recovery.
We propose a novel pipeline for non-contact sound recovery, fully utilizing spatial-temporal information from the event stream.
arXiv Detail & Related papers (2025-04-03T08:51:17Z) - HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response Dataset [0.6568378556428859]
This contribution introduces a dataset of 7th-order Ambisonic Room Impulse Responses (HOA-RIRs) created using the Image Source Method.
By employing higher-order Ambisonics, our dataset enables precise spatial audio reproduction.
The presented 64-microphone configuration allows us to capture RIRs directly in the Spherical Harmonics domain.
arXiv Detail & Related papers (2024-11-21T15:16:48Z) - SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios [19.24195341920164]
We introduce SonicSim, a synthetic toolkit to generate data for moving sound sources.
It supports multi-level adjustments, including scene-level, microphone-level, and source-level.
To validate the differences between synthetic data and real-world data, we randomly selected 5 hours of raw data without reverberation.
The results indicate that the synthetic data generated by SonicSim can effectively generalize to real-world scenarios.
arXiv Detail & Related papers (2024-10-02T12:33:59Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction [51.71299452862839]
We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation.
We then collect real-world data to measure the spectral difference between the simulation and the real world by training AFP models that only take a specific frequency subband as input.
Lastly, we build a real robot platform and show that the transferred policy can successfully navigate to sounding objects.
arXiv Detail & Related papers (2024-05-05T06:01:31Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - BAT: Learning to Reason about Spatial Sounds with Large Language Models [45.757161909533714]
We present BAT, which combines the sound perception ability of a spatial scene analysis model with the natural language reasoning capabilities of a large language model (LLM)
Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning.
arXiv Detail & Related papers (2024-02-02T17:34:53Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Realistic Noise Synthesis with Diffusion Models [68.48859665320828]
Deep image denoising models often rely on large amount of training data for the high quality performance.
We propose a novel method that synthesizes realistic noise using diffusion models, namely Realistic Noise Synthesize Diffusor (RNSD)
RNSD can incorporate guided multiscale content, such as more realistic noise with spatial correlations can be generated at multiple frequencies.
arXiv Detail & Related papers (2023-05-23T12:56:01Z) - SofaMyRoom: a fast and multiplatform "shoebox" room simulator for
binaural room impulse response dataset generation [2.6763498831034043]
This paper introduces a shoebox room simulator able to systematically generate synthetic datasets of room impulse responses (BRIRs) given an arbitrary set of head-related transfer functions (HRTFs)
The evaluation of machine hearing algorithms frequently requires BRIR datasets in order to simulate the acoustics of any environment.
arXiv Detail & Related papers (2021-06-24T13:07:51Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.