Fugu-MT 論文翻訳(概要): SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

論文の概要: SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

arxiv url: http://arxiv.org/abs/2601.04029v1
Date: Wed, 07 Jan 2026 15:45:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 02:15:23.671457
Title: SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency
Title（参考訳）: SpeakerSleuth:マルチターン話者一貫性判定のための大規模オーディオ言語モデルの評価
Authors: Jonggeun Lee, Junseong Pyo, Gyuhyeon Seo, Yohan Jo,
Abstract要約: LALMがマルチターン対話における話者の一貫性を確実に判断できるかどうかを評価するベンチマークである SpeakerSleuth を提案する。合成音声と実音声を対象とする4つの多種多様なデータセットを対象とした1,818の人間検証評価インスタンスを構築した。モデルは音響的不整合を確実に検出するのに苦労している。
参考スコア（独自算出の注目度）: 12.420484491347073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn conversations remains unexplored. We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating nine widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors' turns are provided together, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in choosing the audio that best matches the speaker among several acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges.
Abstract（参考訳）: 音声言語モデル(LALM)は、音声認識の品質を評価するための顕著なアプローチとして登場したが、マルチターン会話における話者の一貫性を評価する能力は、まだ解明されていない。本稿では,LALM が実世界の要求を反映した3つのタスクを通じて,マルチターン対話における話者の一貫性を確実に判断できるかどうかを評価するベンチマークである SpeakerSleuth を提案する。合成音声と実音声にまたがる4つの多様なデータセットに対して,音響的難易度を制御した1,818個の人間検証評価インスタンスを構築した。 9つの広く使われているLALMを評価すると、モデルが確実に音響的不整合を検出するのに苦労していることが分かる。例えば、同じ話者のターンのオーディオサンプルを考えると、いくつかのモデルは矛盾を過度に予測するが、他のモデルは過度に寛容である。モデルはさらに、問題となる正確なターンを特定するのに苦労する。他のインターロケータのターンが一緒に提供されると、モデルが音響的手がかりよりもテキストコヒーレンスを優先し、スピーカーの明確な性別スイッチも検出できないため、性能が劇的に低下する。一方、モデルは、複数の音響変種の中で話者に最もよく適合する音響を選択することで、より優れた音響識別能力を示す。これらの発見は、LALMにおいて重大なバイアスを呈している: 音声よりもテキストを優先する傾向があり、信頼性の高いオーディオ音声判断器を構築するために対処する必要がある基本的なモダリティの不均衡を明らかにする。

論文の概要: SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

関連論文リスト