PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping
- URL: http://arxiv.org/abs/2408.07050v1
- Date: Tue, 13 Aug 2024 17:37:40 GMT
- Title: PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping
- Authors: Subash Khanal, Eric Xing, Srikumar Sastry, Aayush Dhakal, Zhexiao Xiong, Adeel Ahmad, Nathan Jacobs,
- Abstract summary: A soundscape is defined by the acoustic environment a person perceives at a location.
We propose a framework for mapping soundscapes across the Earth.
We represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text.
- Score: 7.076417856575795
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: A soundscape is defined by the acoustic environment a person perceives at a location. In this work, we propose a framework for mapping soundscapes across the Earth. Since soundscapes involve sound distributions that span varying spatial scales, we represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text. To capture the inherent uncertainty in the soundscape of a location, we design the representation space to be probabilistic. We also fuse ubiquitous metadata (including geolocation, time, and data source) to enable learning of spatially and temporally dynamic representations of soundscapes. We demonstrate the utility of our framework by creating large-scale soundscape maps integrating both audio and text with temporal control. To facilitate future research on this task, we also introduce a large-scale dataset, GeoSound, containing over $300k$ geotagged audio samples paired with both low- and high-resolution satellite imagery. We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset. Our dataset and code is available at https://github.com/mvrl/PSM.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping [8.545983117985434]
We focus on the task of soundscape mapping, which involves predicting the most probable sounds that could be perceived at a particular geographic location.
We utilise recent state-of-the-art models to encode geotagged audio, a textual description of the audio, and an overhead image of its capture location.
Our approach significantly outperforms the existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to 0.450.
arXiv Detail & Related papers (2023-09-19T14:49:50Z) - Audio-Visual Spatial Integration and Recursive Attention for Robust
Sound Source Localization [13.278494654137138]
Humans utilize both audio and visual modalities as spatial cues to locate sound sources.
We propose an audio-visual spatial integration network that integrates spatial cues from both modalities.
Our method can perform more robust sound source localization.
arXiv Detail & Related papers (2023-08-11T11:57:58Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Audio Latent Space Cartography [0.0]
We explore the generation of visualisations of audio latent spaces using an audio-to-image generation pipeline.
We believe this can help with the interpretability of audio latent spaces.
arXiv Detail & Related papers (2022-12-05T21:51:33Z) - SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning [127.1119359047849]
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments.
It generates highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations.
SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.
arXiv Detail & Related papers (2022-06-16T17:17:44Z) - Learning Neural Acoustic Fields [110.22937202449025]
We introduce Neural Acoustic Fields (NAFs), an implicit representation that captures how sounds propagate in a physical scene.
By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs.
We demonstrate that the continuous nature of NAFs enables us to render spatial acoustics for a listener at an arbitrary location, and can predict sound propagation at novel locations.
arXiv Detail & Related papers (2022-04-04T17:59:37Z) - Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space [2.8935588665357077]
This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data.
We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
arXiv Detail & Related papers (2020-02-10T20:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.