Fugu-MT 論文翻訳(概要): MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

論文の概要: MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

arxiv url: http://arxiv.org/abs/2509.17143v1
Date: Sun, 21 Sep 2025 16:14:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.131513
Title: MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
Title（参考訳）: MaskVCT:複数誘導による制御性の向上を伴うゼロショット音声変換用マスケ音声コーデックトランス
Authors: Junhyeok Lee, Helin Wang, Yaohan Guan, Thomas Thebaud, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak,
Abstract要約: MaskVCTはマルチファクタ制御が可能なゼロショット音声変換モデルである。このモデルは、インテリジェンス性と話者類似性を高めるために、連続的または量子化された言語的特徴を利用することができる。
参考スコア（独自算出の注目度）: 30.21283213138901
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.
Abstract（参考訳）: 本稿では,複数分類器フリーガイダンス(CFG)による多要素制御が可能なゼロショット音声変換(VC)モデルであるMaskVCTを紹介する。以前のVCモデルは固定条件スキームに依存していたが、MaskVCTは単一のモデルに様々な条件を統合する。強靭性と制御をさらに強化するために、モデルでは、連続的または定量化された言語的特徴を活用して、インテリジェンスと話者の類似性を高め、ピッチの輪郭を利用して韻律を制御できる。これらの選択により、ユーザーはゼロショットVC設定で話者のアイデンティティ、言語コンテンツ、韻律的要素をシームレスにバランスすることができる。大規模な実験により,MaskVCTは既存のベースラインと比較して,競争力のある単語と文字の誤り率を得ながら,最高のターゲット話者とアクセントの類似性を達成できた。オーディオサンプルはhttps://maskvct.github.io/.comで入手できる。

関連論文リスト

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion [16.19865417052239]
Discl-VCはゼロショット音声変換フレームワークである。内容と韻律情報を自己教師型音声表現から切り離す。ターゲット話者の声をテキスト内学習により合成する。
論文参考訳（メタデータ） (2025-05-30T07:04:23Z)
Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching [7.151257248661491]
CTEFM-VCは、コンテント対応の音色アンサンブルモデリングと条件付きフローマッチングを統合するフレームワークである。 CTEFM-VCは、話者の類似性、音声の自然性、および知性を評価するすべての指標において、常に最高の性能を達成している。
論文参考訳（メタデータ） (2024-11-04T12:23:17Z)
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control [50.27383290553548]
ControlSpeechは、話者の声を完全にクローンし、任意の制御と話し方の調整を可能にするTTS(text-to-speech)システムである。 ControlSpeechは、制御性、音色類似性、音質、堅牢性、一般化性の観点から、同等または最先端(SOTA)性能を示す。
論文参考訳（メタデータ） (2024-06-03T11:15:16Z)
VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion [77.50171525265056]
本稿では,音声変換(VC)からのクロスモーダルな知識伝達に基づく,VTS(Multi-Speaker Video-to-Speech)システムを提案する。 Lip2Indネットワークは、VCのコンテンツエンコーダを交換してマルチスピーカVTSシステムを形成し、サイレントビデオを音響ユニットに変換して正確な音声コンテンツを再構成する。
論文参考訳（メタデータ） (2022-02-18T08:58:45Z)
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments [76.98764900754111]
音声変換(Voice Conversion, VC)は, 音源発話の非言語情報を変換し, 話者の同一性を変化させることを目的とした技術である。我々は、特に騒々しいスピーチに適した新しいVCフレームワークであるVoicyを提案する。自動エンコーダフレームワークにインスパイアされた本手法は,4つのエンコーダ(スピーカ,コンテンツ,音声,音響-ASR)と1つのデコーダから構成される。
論文参考訳（メタデータ） (2021-06-16T15:47:06Z)
FastVC: Fast Voice Conversion with non-parallel data [13.12834490248018]
本稿では、高速音声変換(VC)のためのエンドツーエンドモデルであるFastVCを紹介する。 FastVCは、非並列データでトレーニングされた条件付きAutoEncoder(AE)に基づいており、アノテーションは一切必要としない。提案されたモデルの単純な構造にもかかわらず、自然性の観点から見ると、VC Challenge 2020の言語横断タスクのベースラインを上回っている。
論文参考訳（メタデータ） (2020-10-08T18:05:30Z)
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder [53.901873501494606]
自動エンコーダによる音声変換を改良し,コンテンツ,F0,話者識別を同時に行う。我々はF0輪郭を制御でき、ターゲット話者と一致したF0音声を生成し、品質と類似性を大幅に向上させることができる。
論文参考訳（メタデータ） (2020-04-15T22:00:06Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。