Related papers: Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

URL: http://arxiv.org/abs/2506.03388v1
Date: Tue, 03 Jun 2025 20:56:37 GMT
Title: Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery
Authors: Pengyu Chen, Xiao Huang, Teng Fei, Sicheng Wang,
Abstract summary: We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery.<n>We find that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology.
Score: 13.86994497464469
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony--Geophony--Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.

Related papers

SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction [5.989764659998189]
SoundSculpt is a neural network designed to extract target sound fields from ambisonic recordings.<n>SoundSculpt employs an ambisonic-in-ambisonic-out architecture and is conditioned on both spatial information and semantic embeddings.
arXiv Detail & Related papers (2025-05-30T22:15:10Z)
SounDiT: Geo-Contextual Soundscape-to-Landscape Generation [28.099729084181092]
We present a novel problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation.<n>GeoS2L aims to synthesize geographically realistic landscape images from environmental soundscapes.
arXiv Detail & Related papers (2025-05-19T05:47:13Z)
PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping [7.076417856575795]
A soundscape is defined by the acoustic environment a person perceives at a location. We propose a framework for mapping soundscapes across the Earth. We represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text.
arXiv Detail & Related papers (2024-08-13T17:37:40Z)
ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling [57.1025908604556]
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment. We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment. We introduce ActiveRIR, a reinforcement learning policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions.
arXiv Detail & Related papers (2024-04-24T21:30:01Z)
Multi-Level Neural Scene Graphs for Dynamic Urban Environments [64.26401304233843]
We present a novel, decomposable radiance field approach for dynamic urban environments. We propose a multi-level neural scene graph representation that scales to thousands of images from dozens of sequences with hundreds of fast-moving objects.
arXiv Detail & Related papers (2024-03-29T21:52:01Z)
Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation using High-Resolution Domain Adaptation Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task. Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z)
Mitigating Urban-Rural Disparities in Contrastive Representation Learning with Satellite Imagery [19.93324644519412]
We consider the risk of urban-rural disparities in identification of land-cover features. We propose fair dense representation with contrastive learning (FairDCL) as a method for de-biasing the multi-level latent space of convolution neural network models. The obtained image representation mitigates downstream urban-rural prediction disparities and outperforms state-of-the-art baselines on real-world satellite images.
arXiv Detail & Related papers (2022-11-16T04:59:46Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
Few-Shot Audio-Visual Learning of Environment Acoustics [89.16560042178523]
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener. We explore how to infer RIRs based on a sparse set of images and echoes observed in the space. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs.
arXiv Detail & Related papers (2022-06-08T16:38:24Z)
Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video. Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z)
Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding [8.396746290518102]
Urban2Vec is an unsupervised multi-modal framework which incorporates both street view imagery and point-of-interest data. We show that Urban2Vec can achieve performances better than baseline models and comparable to fully-supervised methods in downstream prediction tasks.
arXiv Detail & Related papers (2020-01-29T21:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.