Fugu-MT 論文翻訳(概要): SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

論文の概要: SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

arxiv url: http://arxiv.org/abs/2603.09853v1
Date: Tue, 10 Mar 2026 16:15:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.447177
Title: SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases
Title（参考訳）: SCENEBench: 補助的および産業的ユースケースを対象とした音声理解ベンチマーク
Authors: Laya Iyer, Angelina Wang, Sanmi Koyejo,
Abstract要約: SCENEBenchは、背景音の理解、雑音の局所化、言語間音声の理解、発声者認識という4つの現実世界のカテゴリーにまたがる音声理解の幅広い形態をターゲットにしている。このベンチマークスイートの目的は、発言される単語だけでなく、その発言の仕方や音声の非音声成分を評価することである。我々は5つの最先端のLALMを評価し、重要なギャップを見出す: タスクによってパフォーマンスが異なり、いくつかのタスクはランダムな確率以下で実行され、他のタスクは高い精度を達成する。
参考スコア（独自算出の注目度）: 27.340743922132067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.
Abstract（参考訳）: 大規模言語モデル(LLMs)の進歩は、音声処理において重要な機能を実現し、現在ではLarge Audio Language Models (LALMs)として知られている最先端のモデルを生み出している。しかし,自動音声認識(ASR)を超越した音声理解のための最小限の作業が実施されている。本稿では、背景音理解、雑音定位、言語間音声理解、発声特性認識の4つの分野にまたがる幅広い音声理解を目標とする、SCENEBench (Spatial, Cross-lingual, Environmental, Non-Speech Evaluation) というベンチマークスイートを提案することにより、そのギャップを埋める。これら4つのカテゴリは、アクセシビリティ技術と産業騒音監視の下位ニーズに基づいて選択される。パフォーマンスに加えて、モデルのレイテンシも測定します。このベンチマークスイートの目的は、発言される単語だけでなく、その発言の仕方や音声の非音声成分を評価することである。音声サンプルは人工的に構築されているため(例:2つの自然なオーディオサンプルをオーバーレイすることで)、既存のデータセットをサブサンプリングしてタスク基準に適合させ、生態学的妥当性を評価することで、我々のベンチマークをさらに検証する。我々は5つの最先端のLALMを評価し、重要なギャップを見出す: タスクによってパフォーマンスが異なり、いくつかのタスクはランダムな確率以下で実行され、他のタスクは高い精度を達成する。これらの結果は、モデル機能の改善を目標とする方向を提供する。

論文の概要: SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

関連論文リスト