Fugu-MT 論文翻訳(概要): SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

論文の概要: SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

arxiv url: http://arxiv.org/abs/2310.09653v1
Date: Sat, 14 Oct 2023 19:51:17 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-17 19:01:30.083817
Title: SelfVC: Voice Conversion With Iterative Refinement using Self Transformations
Title（参考訳）: SelfVC:自己変換を用いた反復リファインメントによる音声変換
Authors: Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley
Abstract要約: SelfVCは、自己合成例で音声変換モデルを改善するためのトレーニング戦略である。 SelfVCは、ゼロショット音声変換、言語間音声変換、制御可能な音声合成など、様々なタスクに適用できる。
参考スコア（独自算出の注目度）: 44.827922493748176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on explicitly disentangling speech representations to separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss terms can lead to information loss by discarding finer nuances of the original signal. In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning and speaker verification models. First, we develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Next, we propose a training strategy to iteratively improve the synthesis model for voice conversion, by creating a challenging training objective using self-synthesized examples. In this training approach, the current state of the synthesis model is used to generate voice-converted variations of an utterance, which serve as inputs for the reconstruction task, ensuring a continuous and purposeful refinement of the model. We demonstrate that incorporating such self-synthesized examples during training improves the speaker similarity of generated speech as compared to a baseline voice conversion model trained solely on heuristically perturbed inputs. SelfVC is trained without any text and is applicable to a range of tasks such as zero-shot voice conversion, cross-lingual voice conversion, and controllable speech synthesis with pitch and pace modifications. SelfVC achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
Abstract（参考訳）: 自己合成例を用いて音声変換モデルを反復的に改善する学習戦略であるselfvcを提案する。音声変換における従来の取り組みは、話者特性と言語内容とを別々に符号化するために、音声表現を明示的に切り離すことに重点を置いていた。しかし、タスク固有の損失項を用いてそのような属性をキャプチャするために音声表現を分離することは、元の信号の微妙なニュアンスを捨てることで情報損失につながる可能性がある。本研究では,自己教師型学習モデルと話者検証モデルから導かれる絡み合った音声表現に基づいて,制御可能な音声変換モデルを学習するための枠組みを提案する。まず,音声信号とSSL表現から韻律情報を引き出す手法を開発し,合成モデルにおける予測サブモジュールの訓練を行う。次に,自己合成例を用いて挑戦的な学習目標を作成することにより,音声変換のための合成モデルを反復的に改善する学習戦略を提案する。この学習アプローチでは,音声変換された発話の変動を生成できる合成モデルの現況を用いて,復元作業の入力として機能し,連続的かつ目的的にモデルの洗練が図られる。このような自己合成例を訓練中に組み込むことで、ヒューリスティックに摂動した入力のみに訓練されたベースライン音声変換モデルと比較して、生成音声の話者類似性が向上することを示す。 SelfVCはテキストなしで訓練されており、ゼロショット音声変換、言語間音声変換、ピッチやペース修正による制御可能な音声合成など、様々なタスクに適用できる。 selfvcは、自然性、話者の類似性、合成音声の知性を評価する指標でゼロショット音声変換を実現する。

関連論文リスト

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit は Transformer アーキテクチャを使用するシーケンス・ツー・シーケンス・エンコーダ・デコーダモデルである。また,本モデルでは,書き起こし条件付けの有無にかかわらず,分離の点で優れた性能を発揮することを示す。また、自動音声認識(ASR)の性能を測定し、音声合成の音声サンプルを提供し、我々のモデルの有用性を実証する。
論文参考訳（メタデータ） (2023-08-21T01:52:01Z)
Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion [35.23123094710891]
SSL表現を入力した高相似性ノン・ワン音声変換法を提案する。実験の結果,本手法は教師付き手法と同等の類似性と自然性が得られることがわかった。
論文参考訳（メタデータ） (2023-05-16T04:52:29Z)
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model [13.572330725278066]
提案手法の新たなポイントは、大量のデータで訓練された音声表現から組込みベクトルを得るためにSSLモデルを直接利用することである。この不整合埋め込みにより、未知話者の再生性能が向上し、異なる音声によるリズム伝達が実現される。
論文参考訳（メタデータ） (2023-04-24T10:15:58Z)
ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations [12.20522794248598]
自己教師付き学習で訓練された音声表現を用いたゼロショット音声変換法を提案する。我々は,発話を言語内容,話者特性,発話スタイルなどの特徴に分解するマルチタスクモデルを開発した。次に,その表現から音声信号を効果的に再構成できるピッチと時間予測器を備えた合成モデルを開発する。
論文参考訳（メタデータ） (2023-02-16T08:10:41Z)
A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units [94.64927912924087]
既存のシステムは韻律と言語内容の相関を無視し、変換された音声の自然度を低下させる。自己教師付き離散音声単位を言語表現として活用するカスケードモジュラーシステムを提案する。実験により,本システムは,自然性,知性,話者伝達性,韻律伝達性において,従来の手法よりも優れていたことがわかった。
論文参考訳（メタデータ） (2022-11-12T00:54:09Z)
Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion [34.139871476234205]
自己教師付き不協和音声表現学習の新たな視点からゼロショット音声変換について検討する。任意の話者埋め込みとコンテンツ埋め込みとを逐次変分オートエンコーダ(VAE)デコーダに供給してゼロショット音声変換を行う。 TIMIT と VCTK のデータセットでは,話者の埋め込みとコンテンツ埋め込みに関する話者検証 (SV) と主観的評価,すなわち音声の自然性や類似性を両立させ,ノイズのある音源/ターゲット発話においても頑健である。
論文参考訳（メタデータ） (2022-03-30T23:03:19Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
この研究は、初めて合成に基づくアプローチがこの問題にうまく対応できることを示した。具体的には,離散シンボルの認識に基づく音声分離/強調モデルを提案する。離散シンボルの入力による合成モデルを利用することで、離散シンボル列の予測後、各ターゲット音声を再合成することができる。
論文参考訳（メタデータ） (2021-12-17T08:35:40Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
ワンショット音声変換は、音声表現のアンタングルメントによって効果的に実現できる。コンテンツエンコーディングにはベクトル量子化(VQ)を使用し、トレーニング中に相互情報(MI)を相関指標として導入する。実験結果は,提案手法が効果的に非絡み合った音声表現を学習する際の優位性を反映している。
論文参考訳（メタデータ） (2021-06-18T13:50:38Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
シークエンス・トゥ・シークエンス(seq2seq)音声変換(VC)モデルは、韻律を変換する能力によって魅力的である。我々は,大規模コーパスが容易に利用できる他の音声処理タスク(通常,テキスト音声(TTS)と自動音声認識(ASR))から知識を伝達することを提案する。このような事前訓練されたASRまたはTSモデルパラメータを持つVCモデルは、高忠実で高知能な変換可能な音声に対して効果的な隠れ表現を生成することができると論じる。
論文参考訳（メタデータ） (2020-08-07T11:02:07Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。