Fugu-MT 論文翻訳(概要): AdaSpeech: Adaptive Text to Speech for Custom Voice

論文の概要: AdaSpeech: Adaptive Text to Speech for Custom Voice

arxiv url: http://arxiv.org/abs/2103.00993v1
Date: Mon, 1 Mar 2021 13:28:59 GMT
ステータス: 翻訳完了
システム内更新日: 2021-03-03 17:22:03.921181
Title: AdaSpeech: Adaptive Text to Speech for Custom Voice
Title（参考訳）: AdaSpeech:カスタム音声のための音声への適応テキスト
Authors: Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, Tie-Yan Liu
Abstract要約: 新しい音声の高品質かつ効率的なカスタマイズのための適応型TSシステムであるAdaSpeechを提案する。実験結果から,AdaSpeechはベースライン法よりも適応性が高く,話者毎のパラメータは5K程度であった。
参考スコア（独自算出の注目度）: 104.69219752194863
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at https://speechresearch.github.io/adaspeech/.
Abstract（参考訳）: 商用音声プラットフォームにおける特定のテキスト対音声(TTS)サービスであるカスタム音声は、音声データが少ないターゲットスピーカーのパーソナル音声を合成するソースTTSモデルを適応させることを目指しています。 1) 多様な顧客をサポートするためには、適応モデルがソース音声データと大きく異なる様々な音響条件を扱う必要があり、2) 多数の顧客をサポートするには、適応パラメータは、高い音声品質を維持しながら、各ターゲット話者がメモリ使用量を減らすのに十分な大きさでなければならない。本稿では,新しい音声の高品質かつ効率的なカスタマイズのための適応型ttsシステムであるadaspeechを提案する。 1) 音響条件の異なる2つの音響エンコーダを使用して、トレーニング中のターゲット音声から発話レベルベクトルと音素レベルベクトルのシーケンスを抽出します。推論では、発話レベルベクトルを基準音声から抽出し、音響予測器を使用して音素レベルベクトルを予測します。 2)適応パラメータと音声品質のトレードオフを良好にするため,adaspeechのmel-spectrogramデコーダに条件層正規化を導入し,適応のための話者埋め込みに加えてこの部分を微調整する。我々は、LibriTTSデータセットのソースTSモデルを事前訓練し、VCTKおよびLJSpeechデータセット(LibriTTSとは異なる音響条件)に、20文、約1分間の音声など、ほとんど適応データを持たないように微調整する。実験の結果、AdaSpeechはベースライン方式よりもはるかに優れた適応品質を達成し、各話者ごとに5Kの特定のパラメータしか示さず、カスタム音声の有効性を示しています。オーディオサンプルはhttps://speechresearch.github.io/adaspeech/で入手できる。

論文の概要: AdaSpeech: Adaptive Text to Speech for Custom Voice

関連論文リスト