Fugu-MT 論文翻訳(概要): Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

論文の概要: Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

arxiv url: http://arxiv.org/abs/2307.15484v3
Date: Mon, 18 Dec 2023 12:48:01 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-20 23:31:47.108025
Title: Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding
Title（参考訳）: 条件拡散モデルと言語モデルを用いた最小教師付き音声合成:意味的符号化の比較
Authors: Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang
Abstract要約: Diff-LM-Speech, Tetra-Diff-Speech, Tri-Diff-Speechを提案する。また,変分オートエンコーダと韻律ボトルネックに基づくプロンプトエンコーダ構造を導入し,プロンプト表現能力の向上を図る。実験の結果,提案手法はベースライン法よりも優れていた。
参考スコア（独自算出の注目度）: 57.42429912884543
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Second, we propose Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion model-based modules that design a duration diffusion model to achieve diverse prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive structure consisting of three diffusion model-based modules that verify the non-necessity of existing semantic encoding models and achieve the best results. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
Abstract（参考訳）: 近年,2種類の離散音声表現と2つのシーケンシャル・ツー・シーケンス・タスクを用いてTSを分離することにより,最小限の監督で訓練できるTTS(text-to-Speech)手法への関心が高まっている。しかし, 従来の手法では, 離散表現の高次元および波形歪み, 非自己回帰的手法における時間予測モデルによる韻律平均化問題, 既存の意味的符号化法における情報冗長性と次元爆発問題という3つの問題に悩まされていた。これらの問題に対処するために3つのプログレッシブ手法を提案する。まず,言語モデルと拡散モデルからなる自己回帰構造であるDiff-LM-Speechを提案する。また,変分オートエンコーダと韻律ボトルネックに基づくプロンプトエンコーダ構造を導入し,プロンプト表現能力の向上を図る。次に,4つの拡散モデルに基づくモジュールからなる非自己回帰構造であるTetra-Diff-Speechを提案する。最後に,既存のセマンティクス符号化モデルの不要性を検証する3つの拡散モデルに基づくモジュールからなる非自己回帰構造であるtri-diff-speechを提案する。実験の結果,提案手法はベースライン法よりも優れていた。オーディオサンプルをWebサイトに提供する。

関連論文リスト

The Design Space of Tri-Modal Masked Diffusion Models [28.1724656131266]
テキスト, 画像テキスト, 音声テキストデータのスクラッチから事前学習した最初の3モーダルマスク拡散モデルを提案する。我々の研究は、これまで行われた多モード離散拡散モデルに関する最も大規模な体系的オープンスタディである。
論文参考訳（メタデータ） (2026-02-25T01:02:11Z)
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation [24.85655658070008]
Diffusion Transformer Autoregressive Modeling (DiTAR)は、言語モデルと拡散トランスフォーマーを組み合わせたパッチベースの自動回帰フレームワークである。ゼロショット音声生成において、DiTARは、ロバスト性、話者類似性、自然性において最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (2025-02-06T10:09:49Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
最小教師付き音声合成は、2種類の離散音声表現を組み合わせることでTSを分離する。非自己回帰フレームワークは、制御可能性を高め、持続拡散モデルは、多様化された韻律表現を可能にする。
論文参考訳（メタデータ） (2023-09-27T09:27:03Z)
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
拡散モデルは確率論的アプローチにより高品質なデータを生成することができる。これは、多くの時間ステップを必要とするため、生成速度が遅くなるという欠点に悩まされる。本稿では、逆過程の分布を学習する拡散判別器と、生成されたデータの分布を学習するスペクトログラム判別器の2つの識別器を用いた音声合成モデルを提案する。
論文参考訳（メタデータ） (2023-08-03T07:22:04Z)
Towards Robust FastSpeech 2 by Modelling Residual Multimodality [4.4904382374090765]
FastSpeech 2に基づく最先端の非自己回帰型音声合成モデルは、高忠実度と自然な音声を効率的に合成することができる。表現型音声データセットにおける特徴的音声歪みを観察する。 TVC-GMMはスペクトログラムの滑らかさを低減し、特に表現的データセットの知覚音質を改善する。
論文参考訳（メタデータ） (2023-06-02T11:03:26Z)
DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
本稿では,遅延拡散に基づく新しい音声合成モデルDiffVoiceを提案する。 LJSpeech と LibriTTS データセットの主観評価は,本手法が自然界で最高の公開システムに勝っていることを示す。
論文参考訳（メタデータ） (2023-04-23T21:05:33Z)
SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
本研究では,拡散モデルを用いてシーケンス・ツー・シーケンスのテキスト生成を行う。シーケンス・ツー・シーケンス生成のためのテキスト拡散モデルであるSeqDiffuSeqを提案する。実験結果は、テキストの品質と推論時間の観点から、シーケンス・ツー・シーケンス生成の優れた性能を示す。
論文参考訳（メタデータ） (2022-12-20T15:16:24Z)
DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models [81.84866217721361]
DiffusionBERTは離散拡散モデルに基づく新しい生成マスク付き言語モデルである。本稿では,各ステップに付加される雑音の度合いを制御する前方拡散プロセスのための新しいノイズスケジュールを提案する。非条件テキスト生成の実験では、DiffusionBERTは既存のテキスト拡散モデルよりも大幅に改善されている。
論文参考訳（メタデータ） (2022-11-28T03:25:49Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeechは、両側摂動を伴う音声から音声への翻訳モデルである。我々は,非自己回帰S2ST手法を構築し,繰り返しマスキングを行い,単位選択を予測する。 TranSpeechは推論遅延を大幅に改善し、自動回帰技術よりも最大21.4倍のスピードアップを実現している。
論文参考訳（メタデータ） (2022-05-25T06:34:14Z)
Speech Summarization using Restricted Self-Attention [79.89680891246827]
音声要約に最適化された単一モデルを提案する。提案モデルでは,ハウ-2コーパスの音声を直接要約する手法が提案されている。
論文参考訳（メタデータ） (2021-10-12T18:21:23Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。