Fugu-MT 論文翻訳(概要): No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

論文の概要: No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

arxiv url: http://arxiv.org/abs/2509.18531v1
Date: Tue, 23 Sep 2025 01:51:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.641089
Title: No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS
Title（参考訳）: プロソディの検証不能なリワード:TTSにおける優先指導型プロソディラーニングに向けて
Authors: Seungyoun Shin, Dongha Ahn, Jiwoo Kim, Sungwook Jeon,
Abstract要約: Group Relative Policy Optimization (GRPO) を用いたニューラルテキスト音声(TTS)の最近の研究動向テキストプロソディに対する検証可能な報酬がないため、GRPOは転写指向信号(CER/NLL)を訓練し、誤り率を下げるが、プロソディを単調で不自然な音声に分解する。本手法では,1ラウンドあたり数百の人間ラベルの選好ペアのみを使用するテキスト開始直接選好最適化(DPO)方式でこの問題に対処する。
参考スコア（独自算出の注目度）: 1.9492333719038202
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}
Abstract（参考訳）: 最近の研究報告では、グループ相対ポリシー最適化(GRPO)によるニューラルテキスト音声(TTS)の進歩が報告されている。しかし、‘textit{prosody} に対する検証可能な報酬がないため、GRPO は転写指向信号 (CER/NLL) で訓練し、誤り率を下げるが、プロソディを単調で不自然な音声に分解し、話者類似性がさらにトレーニングを不安定にし、CERを劣化させる。本手法では,現在モデルに正規化しつつ,韻律的自然性を直接最適化するために,100組の人間ラベル付き選好ペアのみを使用する。タスク指向対話を捉えた韓国のコールセンターインタラクションのキュレートしたデータセットである \textbf{KoCC-TTS} において,本手法は競争力のあるCERと最高の人選好(ELO)を達成し,GRPO と強力な商業ベースラインを達成している。これらの結果は,韻律が自動的に報われることができない場合,自然かつ堅牢なTSへの実践的でデータ効率のよい経路を提供することを示唆している。デモページは \href{https://tts.ch.dev} で公開されている。

論文の概要: No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

関連論文リスト