Fugu-MT 論文翻訳(概要): PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

論文の概要: PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

arxiv url: http://arxiv.org/abs/2604.09111v3
Date: Tue, 14 Apr 2026 01:51:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 14:01:13.233887
Title: PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
Title（参考訳）: PS-TTS:自然な自動ダビングを実現するためのテキスト音声合成
Authors: Changi Hong, Yoonah Song, Hwayoung Park, Chaewoon Bang, Dayeon Gu, Do Hyun Lee, Hong Kook Kim,
Abstract要約: 本稿では,翻訳テキストをパラフレーズ化するADプロセスの同期手法を提案する。我々は、翻訳されたテキストを言語モデルで表現することで、アイソクロニーを実現する。第2に、トレーニングデータから測定した母音距離の局所的なコストで動的時間ワープ(DTW)を利用するPSを導入し、ターゲットテキストが原母音に似た発音で母音を構成するようにした。第三に、このアプローチをPSCometに拡張し、意味を良く保つために意味的および音声的類似性を共同で検討する。
参考スコア（独自算出の注目度）: 2.374660957323975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes vowels with pronunciations similar to source vowels. Third, we extend this approach to PSComet, which jointly considers semantic and phonetic similarity to preserve meaning better. The proposed methods are incorporated into text-to-speech systems, PS-TTS and PS-Comet TTS. The performance evaluation using Korean and English lip-reading datasets and a voice-actor dubbing dataset demonstrates that both systems outperform TTS without PS on several objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. We extend the experiments to French, testing all pairs among these languages to evaluate cross-linguistic applicability. Across all language pairs, PS-Comet performed best, balancing lip-sync accuracy with semantic preservation, confirming that PS-Comet achieves more accurate lip-sync with semantic preservation than PS alone.
Abstract（参考訳）: 近年、人工知能に基づくダビング技術が進歩し、ビデオのソース音声を異なる言語でターゲット音声に変換する自動ダビング(AD)が可能になった。しかし、自然のADは継続時間やリップ同期(lip-sync)といった同期の課題に直面している。そこで本研究では,翻訳テキストをパラフレーズで表現するADプロセスの同期手法を提案する。まず、翻訳されたテキストを言語モデルで表現し、対象の発話期間を元の音声と一致させることで、同調性を実現する。第2に、トレーニングデータから測定した母音距離の局所的なコストで動的時間ワープ(DTW)を利用するPSを導入し、ターゲットテキストが原母音に似た発音で母音を構成するようにした。第三に、このアプローチをPSCometに拡張し、意味を良く保つために意味的および音声的類似性を共同で検討する。提案手法は,音声合成システム,PS-TTS,PS-Comet TTSに組み込まれている。韓国語と英語の口唇読取データセットと音声アクターダビングデータセットを用いた性能評価の結果,PSを使わずにTTSより優れ,韓国語と英語と韓国語によるダビングでは音声アクターより優れていた。実験をフランス語に拡張し、これらの言語間の全てのペアをテストし、言語間適用性を評価する。全ての言語ペアにおいて、PS-Cometは、PS単独よりも正確にリップシンクの精度と意味保存のバランスをとり、PS-Cometが意味保存よりも正確なリップシンクを実現していることを確認した。

論文の概要: PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

関連論文リスト