Fugu-MT 論文翻訳(概要): RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

論文の概要: RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

arxiv url: http://arxiv.org/abs/2603.09809v1
Date: Tue, 10 Mar 2026 15:37:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.430596
Title: RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding
Title（参考訳）: RA-SSU:領域認識型音源理解による細粒度オーディオ・ビジュアル・ラーニングを目指して
Authors: Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang, Xingqun Qi, Qi Li, Zhenan Sun,
Abstract要約: Region-Aware Sound Source Understanding (RA-SSU) は、地域対応、フレームレベル、高品質な音源理解を実現することを目的としている。この目標を達成するために、我々は、2つの対応するデータセット、すなわち、細粒度音楽(f-Music)と細粒度ライフセンテ(f-Lifescene)を革新的に構築する。 f-Lifesceneデータセットには61種類のサンプルが6,156個含まれている。
参考スコア（独自算出の注目度）: 43.03351086257886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.
Abstract（参考訳）: オーディオ・ビジュアル・ラーニング(AVL)はマルチモーダル・ラーニングとインテリジェンスの基本課題の一つであり、シーン理解とインタラクションにおいて重要な役割を担っている。しかし、従来の研究者は主に、粗い視野(例えば、音声-視覚対応、音源の定位、音声-視覚イベントの定位)から下流タスクを探索することに焦点を当てていた。より具体的なシーン認識の詳細を考慮し、領域認識、フレームレベル、高品質な音源理解を実現することを目的とした、領域認識音源理解(RA-SSU)と呼ばれる、きめ細かいオーディオ・ビジュアル・ラーニングタスクを新たに定義する。この目標を達成するために、我々は、2つの対応するデータセット、すなわち細粒度音楽(f-Music)と細粒度ライフセンタ(f-Lifescene)を革新的に構築し、それぞれに注釈付き音源マスクとフレーム単位のテキスト記述を含む。 f-Musicデータセットには、特定のアプリケーションシナリオに関連する22のシーンタイプにわたる3,976のサンプルが含まれている。 f-Lifesceneデータセットには61種類のサンプルが6,156個含まれている。さらに,マルチモーダル入力とマルチモーダル出力アーキテクチャを用いて,音源分割と音源領域記述の両方を容易にするサウンドソース理解型トランスフォーマーのベンチマークであるSSUFormerを提案する。具体的には,MCM (Mask Collaboration Module) とMoHE (Mixture of Hierarchical-prompted Experts) の2つのモジュールを設計し,それぞれ精度を高め,音源記述の精度を高める。提案する2つのデータセットを用いて,タスクの実現可能性の検証,データセットの可用性の評価,および音源理解ベンチマーク上でSOTA性能を達成するSSUFormerの優位性を示すため,広範囲な実験を行った。

論文の概要: RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

関連論文リスト