Fugu-MT 論文翻訳(概要): Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

論文の概要: Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

arxiv url: http://arxiv.org/abs/2603.10623v1
Date: Wed, 11 Mar 2026 10:34:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.898221
Title: Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context
Title（参考訳）: Geo-ATBench:Geospatial Semantic Contextを用いたGeospatial Audio Taggingのベンチマーク
Authors: Yuanbo Hou, Yanru Wu, Qiaoqiao Ren, Shengchen Li, Stephen Roberts, Dick Botteldooren,
Abstract要約: GeoFusion-ATは、代表音声のバックボーンに特徴、表現、決定レベルの融合を評価できる統合ジオオーディオ融合フレームワークとして提案されている。以上の結果から, GSCを組み込むことでAT性能が向上することがわかった。
参考スコア（独自算出の注目度）: 16.979013371188074
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to separate from waveforms alone. In such cases, disambiguating cues often lie outside the waveform. Geospatial semantic context (GSC), derived from geographic information system data, e.g., points of interest (POI), provides location-tied environmental priors that can help reduce this ambiguity. A systematic study of this direction is enabled through the proposed geospatial audio tagging (Geo-AT) task, which conditions multi-label sound event tagging on GSC alongside audio. To benchmark Geo-AT, Geo-ATBench is introduced as a polyphonic audio benchmark with geographical annotations, containing 10.71 hours of audio across 28 event categories; each clip is paired with a GSC representation from 11 semantic context categories. GeoFusion-AT is proposed as a unified geo-audio fusion framework that evaluates feature-, representation-, and decision-level fusion on representative audio backbones, with audio- and GSC-only baselines. Results show that incorporating GSC improves AT performance, especially on acoustically confounded labels, indicating geospatial semantics provide effective priors beyond audio alone. A crowdsourced listening study with 10 participants on 579 samples shows that there is no significant difference in performance between models on Geo-ATBench labels and aggregated human labels, supporting Geo-ATBench as a human-aligned benchmark. The Geo-AT task, benchmark Geo-ATBench, and reproducible geo-audio fusion framework GeoFusion-AT provide a foundation for studying AT with geospatial semantic context within the CASA community. Dataset, code, models are on homepage (https://github.com/WuYanru2002/Geo-ATBench).
Abstract（参考訳）: CASA(Computer auditory scene analysis)における環境音の理解は、しばしば音声のみの認識問題として定式化される。この定式化は、マルチラベルオーディオタギング(AT: Multi-label audio tagging)において永続的な欠点を残している。そのような場合、曖昧なキューはしばしば波形の外側に置かれる。地理空間意味文脈(Geospatial semantic context, GSC)は、地理情報システムデータ(例えば、関心点(POI))から派生したもので、そのあいまいさを軽減するのに役立つ位置付けされた環境事前情報を提供する。この方向のシステマティックな研究は,GSCに複数ラベルの音声イベントタグを付加したGeo-AT(Geo-AT)タスクによって実現されている。 Geo-ATをベンチマークするために、Geo-ATBenchは地理的アノテーションを備えたポリフォニックオーディオベンチマークとして導入され、28のイベントカテゴリにわたる10.71時間のオーディオを含む。 GeoFusion-ATは、オーディオとGSCのみのベースラインで、代表オーディオのバックボーンに特徴、表現、決定レベルの融合を評価できる統合ジオオーディオ融合フレームワークとして提案されている。以上の結果から,特に音響的に構築されたラベルにおいて,GSCを組み込むことでAT性能が向上することが示唆された。 579のサンプルに対する10人の参加者によるクラウドソースによる聞き取り調査では、Geo-ATBenchラベルと集約された人間ラベルのモデルのパフォーマンスに有意な差はないことが示され、Geo-ATBenchを人間対応のベンチマークとしてサポートしている。 Geo-ATタスク、ベンチマークGeo-ATBench、再現可能なジオオーディオ融合フレームワークGeoFusion-ATは、CASAコミュニティ内の地理空間意味コンテキストでATを研究する基盤を提供する。データセット、コード、モデルはホームページにある(https://github.com/WuYanru2002/Geo-ATBench)。

論文の概要: Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

関連論文リスト