Fugu-MT 論文翻訳(概要): PromptTTS 2: Describing and Generating Voices with Text Prompt

論文の概要: PromptTTS 2: Describing and Generating Voices with Text Prompt

arxiv url: http://arxiv.org/abs/2309.02285v1
Date: Tue, 5 Sep 2023 14:45:27 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-06 14:13:27.446080
Title: PromptTTS 2: Describing and Generating Voices with Text Prompt
Title（参考訳）: PromptTTS 2: テキストプロンプトによる音声の記述と生成
Authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian
Abstract要約: テキストプロンプトに基づくTTSアプローチは,1)音声の多様性に関する詳細をテキストプロンプトに記述することができない1対多の問題という2つの課題に直面している。本稿では,テキストプロンプトで捉えない音声の可変性情報を提供するために,変分ネットワークを用いてこれらの課題に対処するPromptTTS 2を提案する。プロンプト生成パイプラインでは、音声理解モデルを用いてテキストプロンプトを生成し、音声属性を音声認識し、大きな言語モデルでテキストプロンプトを定式化する。
参考スコア（独自算出の注目度）: 102.93668747303975
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Speech conveys more information than just text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompt for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompt based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online\footnote{https://speechresearch.github.io/prompttts2}.
Abstract（参考訳）: 音声は単にテキスト以上の情報を伝達し、同じ単語を様々な声で発声して多様な情報を伝えることができる。音声の可変性のために音声プロンプト(参照音声)に依存する従来のtts(text-to-speech)メソッドと比較して、テキストプロンプト(記述)の使用は、音声プロンプトを見つけるのが困難か全く存在しないか、ユーザフレンドリである。テキストプロンプトに基づくTSアプローチは2つの課題に直面している。 1)テキストプロンプトに音声の変動性に関するすべての詳細を記述できない一対一の問題がある。 2) テキストプロンプトデータセットが限られており、ベンダーと大量のデータラベリングが音声のテキストプロンプトを書くために必要となる。本稿では,テキストプロンプトでキャプチャされていない音声の可変性情報を提供するために,これらの課題に対処するPromptTTS 2と,高品質なテキストプロンプトを構成するために大規模言語モデル(LLM)を利用するプロンプト生成パイプラインを導入する。具体的には、テキストプロンプト表現に基づいて、参照音声(音声に関する全情報を含む)から抽出された表現を予測する。このプロンプト生成パイプラインでは、音声理解モデルを用いて音声のテキストプロンプトを生成し、音声から音声属性(例えば、性別、速度)を認識し、大言語モデルを用いて認識結果に基づいてテキストプロンプトを定式化する。大規模(44K時間)の音声データセットの実験では、PromptTTS 2は以前の研究と比較すると、テキストプロンプトとより整合性のある音声を生成し、多様な音声のバラツキのサンプリングをサポートする。さらに、プロンプト生成パイプラインは高品質なプロンプトを生成し、大きなラベリングコストを削減します。 PromptTTS 2のデモページはオンラインで公開されている。

論文の概要: PromptTTS 2: Describing and Generating Voices with Text Prompt

関連論文リスト