Related papers: Improving Acoustic Scene Classification with City Features

Improving Acoustic Scene Classification with City Features

URL: http://arxiv.org/abs/2503.16862v2
Date: Fri, 13 Jun 2025 02:00:40 GMT
Title: Improving Acoustic Scene Classification with City Features
Authors: Yiqiang Cai, Yizhou Tan, Shengchen Li, Xi Shao, Mark D. Plumbley,
Abstract summary: City2Scene is a novel framework that leverages city features to improve acoustic scene classification.<n>By distilling city-specific knowledge, City2Scene effectively improves accuracy across a variety of lightweight CNN backbones.
Score: 14.60560396933802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Acoustic scene recordings are often collected from a diverse range of cities. Most existing acoustic scene classification (ASC) approaches focus on identifying common acoustic scene patterns across cities to enhance generalization. However, the potential acoustic differences introduced by city-specific environmental and cultural factors are overlooked. In this paper, we hypothesize that the city-specific acoustic features are beneficial for the ASC task rather than being treated as noise or bias. To this end, we propose City2Scene, a novel framework that leverages city features to improve ASC. Unlike conventional approaches that may discard or suppress city information, City2Scene transfers the city-specific knowledge from pre-trained city classification models to scene classification model using knowledge distillation. We evaluate City2Scene on three datasets of DCASE Challenge Task 1, which include both scene and city labels. Experimental results demonstrate that city features provide valuable information for classifying scenes. By distilling city-specific knowledge, City2Scene effectively improves accuracy across a variety of lightweight CNN backbones, achieving competitive performance to the top-ranked solutions of DCASE Challenge in recent years.

Related papers

Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes [0.9208007322096533]
This paper introduces SAGAI: Streetscape Analysis with Generative Artificial Intelligence. It is a modular workflow for scoring street-level urban scenes using open-access data and vision-language models. It operates without task-specific training or proprietary software dependencies.
arXiv Detail & Related papers (2025-04-23T09:08:06Z)
EMPLACE: Self-Supervised Urban Scene Change Detection [6.250018240133604]
Urban Scene Change Detection (USCD) aims to capture changes in street scenes using computer vision. We introduce AC-1M the largest USCD dataset by far of over 1.1M images, together with EMPLACE, a self-supervising method to train a Vision Transformer. In a case study of Amsterdam, we show that we are able to detect both small and large changes throughout the city and that changes uncovered by EMPLACE, depending on size, correlate with housing prices.
arXiv Detail & Related papers (2025-03-22T10:20:43Z)
Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos [101.48581851337703]
We present BTimer, the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes. Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target ('bullet') timestamp by aggregating information from all the context frames. Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets.
arXiv Detail & Related papers (2024-12-04T18:15:06Z)
COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation [1.5745692520785073]
We introduce a novel graph-based masked autoencoder (GMAE) for city-scale urban layout generation. The method encodes attributed buildings, city blocks, communities and cities into a unified graph structure. Our approach achieves good realism, semantic consistency, and correctness across the heterogeneous urban styles in 330 US cities.
arXiv Detail & Related papers (2024-07-16T00:49:53Z)
Towards better visualizations of urban sound environments: insights from interviews [1.2599533416395765]
We analyze the need for the representations of sound sources, by identifying the urban stakeholders for whom such representations are assumed to be of importance. Three distinct use of sound source representations emerged in this study: noise-related complaints for industrials and specialized citizens, soundscape quality assessment for citizens, and guidance for urban planners. Findings reveal diverse perspectives for the use of visualizations, which should use indicators adapted to the target audience, and enable data accessibility.
arXiv Detail & Related papers (2024-06-11T07:39:48Z)
CityCraft: A Real Crafter for 3D City Generation [25.7885801163556]
CityCraft is an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction.
arXiv Detail & Related papers (2024-06-07T14:49:00Z)
Urban Scene Diffusion through Semantic Occupancy Map [49.20779809250597]
UrbanDiffusion is a 3D diffusion model conditioned on a Bird's-Eye View (BEV) map. Our model learns the data distribution of scene-level structures within a latent space. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes.
arXiv Detail & Related papers (2024-03-18T11:54:35Z)
Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation using High-Resolution Domain Adaptation Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task. Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z)
Robust, General, and Low Complexity Acoustic Scene Classification Systems and An Effective Visualization for Presenting a Sound Scene Context [53.80051967863102]
We present a comprehensive analysis of Acoustic Scene Classification (ASC) We propose an inception-based and low footprint ASC model, referred to as the ASC baseline. Next, we improve the ASC baseline by proposing a novel deep neural network architecture.
arXiv Detail & Related papers (2022-10-16T19:07:21Z)
Urban Rhapsody: Large-scale exploration of urban soundscapes [12.997538969557649]
Noise is one of the primary quality-of-life issues in urban environments. Low-cost sensors can be deployed to monitor ambient noise levels at high temporal resolutions. The amount of data they produce and the complexity of these data pose significant analytical challenges. We propose Urban Rhapsody, a framework that combines state-of-the-art audio representation, machine learning, and visual analytics.
arXiv Detail & Related papers (2022-05-25T22:02:36Z)
Robust Feature Learning on Long-Duration Sounds for Acoustic Scene Classification [54.57150493905063]
Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded. We propose a robust feature learning (RFL) framework to train the CNN.
arXiv Detail & Related papers (2021-08-11T03:33:05Z)
Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [119.72951028190586]
Crowd localization is a new computer vision task, evolved from crowd counting. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes. We propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes.
arXiv Detail & Related papers (2021-08-02T01:27:53Z)
Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition [61.54648991466747]
We explore an audiovisual aerial scene recognition task using both images and sounds as input. We show the benefit of exploiting the audio information for the aerial scene recognition.
arXiv Detail & Related papers (2020-05-18T04:14:16Z)
Indexical Cities: Articulating Personal Models of Urban Preference with Geotagged Data [0.0]
This research characterizes personal preference in urban spaces and predicts a spectrum of unknown likeable places for a specific observer. Unlike most urban perception studies, our intention is not by any means to provide an objective measure of urban quality, but rather to portray personal views of the city or Cities of Cities.
arXiv Detail & Related papers (2020-01-23T11:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.