Fugu-MT 論文翻訳(概要): Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

論文の概要: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

arxiv url: http://arxiv.org/abs/2603.14889v1
Date: Mon, 16 Mar 2026 06:39:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:36.108525
Title: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
Title（参考訳）: モダリティと口語性を考慮した音声対話リワードのモデル化とベンチマーク
Authors: Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan, Xueyi Pu, Yifu Chen, Chenyuhao Wen, Tianle Liang, Zhou Zhao,
Abstract要約: SDiaReward-Datasetでトレーニングしたエンドツーエンドのマルチターン報酬モデルであるSDiaRewardを紹介する。完全なマルチターン音声エピソードで直接動作し、ペアワイズ・プライオリティ・インスペクションに最適化されている。実験により、SDiaRewardは最先端のペアの選好精度を達成することが示された。
参考スコア（独自算出の注目度）: 45.06366615980232
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.
Abstract（参考訳）: エンド・ツー・エンド音声対話システムの急速な進化は、パラ言語的ニュアンスと人間の会話の自然の性質を組み込むために、単なるテキスト意味論を超越することを要求する。しかし、現在の手法では、韻律と感情を含むモダリティギャップと、自然な音声から書かれた文字を区別する口語間ギャップという2つの重要なギャップに苦慮している。これらの課題に対処するために、SDiaRewardはSDiaReward-Datasetで訓練されたエンドツーエンドのマルチターン報酬モデルである。完全なマルチターン音声エピソードを直接操作し、一対の選好監督に最適化されており、単一の評価器におけるモダリティと口語性の共同評価を可能にしている。さらに,堅牢なエピソードレベル評価のための階層化ベンチマークであるESDR-Benchを確立する。実験により、SDiaRewardは最先端のペアの選好精度を達成し、汎用オーディオLLMを著しく上回った。さらに分析した結果,SDiaRewardは表層合成以上の相対的な会話表現性を捉え,領域間の一般化と記録条件を改善することが示唆された。コード、データ、デモはhttps://sdiareward.github.io/.com/で公開されている。

論文の概要: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

関連論文リスト