Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
- URL: http://arxiv.org/abs/2309.10667v1
- Date: Tue, 19 Sep 2023 14:49:50 GMT
- Title: Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
- Authors: Subash Khanal, Srikumar Sastry, Aayush Dhakal, Nathan Jacobs
- Abstract summary: We focus on the task of soundscape mapping, which involves predicting the most probable sounds that could be perceived at a particular geographic location.
We utilise recent state-of-the-art models to encode geotagged audio, a textual description of the audio, and an overhead image of its capture location.
Our approach significantly outperforms the existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to 0.450.
- Score: 8.545983117985434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We focus on the task of soundscape mapping, which involves predicting the
most probable sounds that could be perceived at a particular geographic
location. We utilise recent state-of-the-art models to encode geotagged audio,
a textual description of the audio, and an overhead image of its capture
location using contrastive pre-training. The end result is a shared embedding
space for the three modalities, which enables the construction of soundscape
maps for any geographic region from textual or audio queries. Using the
SoundingEarth dataset, we find that our approach significantly outperforms the
existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to
0.450. Our code is available at https://github.com/mvrl/geoclap.
Related papers
- PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping [7.076417856575795]
A soundscape is defined by the acoustic environment a person perceives at a location.
We propose a framework for mapping soundscapes across the Earth.
We represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text.
arXiv Detail & Related papers (2024-08-13T17:37:40Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Listen2Scene: Interactive material-aware binaural sound propagation for
reconstructed 3D scenes [69.03289331433874]
We present an end-to-end audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications.
We propose a novel neural-network-based sound propagation method to generate acoustic effects for 3D models of real environments.
arXiv Detail & Related papers (2023-02-02T04:09:23Z) - Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations [65.37621891132729]
We build the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation.
We present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space.
Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff.
arXiv Detail & Related papers (2023-01-04T18:47:32Z) - Localizing Visual Sounds the Easy Way [26.828874753756523]
Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training.
We propose EZ-VSL, without relying on the construction of positive and/or negative regions during training.
Our framework achieves state-of-the-art performance on two popular benchmarks, Flickr SoundNet and VGG-Sound Source.
arXiv Detail & Related papers (2022-03-17T13:52:58Z) - Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped
Environments with Moving Sounds [5.002862602915434]
Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment.
We propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds.
We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments.
arXiv Detail & Related papers (2021-11-29T15:17:46Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.