Related papers: Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

URL: http://arxiv.org/abs/2505.13777v1
Date: Mon, 19 May 2025 23:36:04 GMT
Title: Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Authors: Subash Khanal, Srikumar Sastry, Aayush Dhakal, Adeel Ahmad, Nathan Jacobs,
Abstract summary: We present Sat2Sound, a framework to predict the distribution of sounds at any location on Earth.<n>Our approach incorporates contrastive learning across audio, audio captions, satellite images, and satellite image captions.<n>We introduce a novel application: location-based soundscape synthesis, which enables immersive acoustic experiences.
Score: 7.291750095728984
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present Sat2Sound, a multimodal representation learning framework for soundscape mapping, designed to predict the distribution of sounds at any location on Earth. Existing methods for this task rely on satellite image and paired geotagged audio samples, which often fail to capture the diversity of sound sources at a given location. To address this limitation, we enhance existing datasets by leveraging a Vision-Language Model (VLM) to generate semantically rich soundscape descriptions for locations depicted in satellite images. Our approach incorporates contrastive learning across audio, audio captions, satellite images, and satellite image captions. We hypothesize that there is a fixed set of soundscape concepts shared across modalities. To this end, we learn a shared codebook of soundscape concepts and represent each sample as a weighted average of these concepts. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on two datasets: GeoSound and SoundingEarth. Additionally, building on Sat2Sound's ability to retrieve detailed soundscape captions, we introduce a novel application: location-based soundscape synthesis, which enables immersive acoustic experiences. Our code and models will be publicly available.

Related papers

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction [5.989764659998189]
SoundSculpt is a neural network designed to extract target sound fields from ambisonic recordings.<n>SoundSculpt employs an ambisonic-in-ambisonic-out architecture and is conditioned on both spatial information and semantic embeddings.
arXiv Detail & Related papers (2025-05-30T22:15:10Z)
SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding [51.311553815466446]
We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints.<n>Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint.
arXiv Detail & Related papers (2025-04-08T00:22:16Z)
PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping [7.076417856575795]
A soundscape is defined by the acoustic environment a person perceives at a location. We propose a framework for mapping soundscapes across the Earth. We represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text.
arXiv Detail & Related papers (2024-08-13T17:37:40Z)
Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping [8.545983117985434]
We focus on the task of soundscape mapping, which involves predicting the most probable sounds that could be perceived at a particular geographic location. We utilise recent state-of-the-art models to encode geotagged audio, a textual description of the audio, and an overhead image of its capture location. Our approach significantly outperforms the existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to 0.450.
arXiv Detail & Related papers (2023-09-19T14:49:50Z)
Generating Realistic Images from In-the-wild Sounds [2.531998650341267]
We propose a novel approach to generate images from in-the-wild sounds. First, we convert sound into text using audio captioning. Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound.
arXiv Detail & Related papers (2023-09-05T17:36:40Z)
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF. We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z)
Novel-View Acoustic Synthesis [140.1107768313269]
We introduce the novel-view acoustic synthesis (NVAS) task. given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space.
arXiv Detail & Related papers (2023-01-20T18:49:58Z)
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning [127.1119359047849]
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. It generates highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.
arXiv Detail & Related papers (2022-06-16T17:17:44Z)
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.