Disentangling content and speaking style information is essential for
zero-shot non-parallel voice conversion (VC). Our previous study investigated a
novel framework with disentangled sequential variational autoencoder (DSVAE) as
the backbone for information decomposition. We have demonstrated that
simultaneous disentangling content embedding and speaker embedding from one
utterance is feasible for zero-shot VC. In this study, we continue the
direction by raising one concern about the prior distribution of content branch
in the DSVAE baseline. We find the random initialized prior distribution will
force the content embedding to reduce the phonetic-structure information during
the learning process, which is not a desired property. Here, we seek to achieve
a better content embedding with more phonetic information preserved. We propose
conditional DSVAE, a new model that enables content bias as a condition to the
prior modeling and reshapes the content embedding sampled from the posterior
distribution. In our experiment on the VCTK dataset, we demonstrate that
content embeddings derived from the conditional DSVAE overcome the randomness
and achieve a much better phoneme classification accuracy, a stabilized
vocalization and a better zero-shot VC performance compared with the
competitive DSVAE baseline.
Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition.
We find the random initialized prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property.
Here, we seek to achieve a better content embedding with more phonetic information preserved.
ここでは、より多くの音声情報を保存したより優れたコンテンツ埋め込みの実現を目指す。
0.59
We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling and reshapes the content embedding sampled from the posterior distribution.
In our experiment on the VCTK dataset, we demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy, a stabilized vocalization and a better zero-shot VC performance compared with the competitive DSVAE baseline.
Index Terms: Voice Conversion, DSVAE, Representation Learning, Generative Model, Zero-shot style transfer
指標項:音声変換、DSVAE、表現学習、生成モデル、ゼロショットスタイル転送
0.69
1. Introduction Voice Conversion (VC) is a technique that converts the nonlinguistic information of a given utterance to a target style (e g , speaker identity, emotion, accent or rhythm etc.), while preserving the linguistic content information.
VC has become a very active research topic in speech processing with potential applications in privacy protection speaker de-identification, audio editing or sing voice conversion/generatio n [1–3].
Current VC systems embrace the technological advancements from statistical modeling to deep learning and have made a major shift on how the pipeline develops [1].
For example, the conventional VC approaches with parallel training data utilize a conversion module to map source acoustic features to target acoustic features, the source-target pair has to be aligned before the mapping [4].
For VC with non-parallel data, direct feature mapping method is difficult.
非並列データを持つVCにとって、直接特徴マッピング法は難しい。
0.71
Instead, studies start to explicitly learn the speaking style and content representations and train a neural network as a decoder to reconstruct the acoustic feature, with the assumption that the decoder can also generalize well when the content and speaker style is swapped during the conversion.
The encoder decomposes the speaking style and the content information into the latent embedding, and the decoder generates a voice sample by combining both disentangled information.
Nevertheless, these models require supervisions such as positive pair of utterances (i.e., two utterances come from the same speaker), and the systems still have to rely on pretrained speaker models.
This categorical of method usually assumes that the speaker of source-target VC pair is pre-known, which limits the application of such models in the real world.
At the same time, bunch of regularization terms have to be applied in the training process, which imposes generalization doubts to such systems for zero-shot non-parallel VC scenarios.
Our previous study proposed a novel disentangled sequential variational autoencoder (DSVAE) [15] as a backbone framework for zero-shot non-parallel VC.
We designed two branches in the encoder of DSVAE to hold the time-varying and the timeinvariant components, where balanced content and speaking style information flow is achieved with the VAE training [16].
We demonstrated that the vanilla VAE [16, 17] loss can be extended to force strong disentanglement between speaker and content components, which is essential for the success of challenging zero-shot non-parallel VC.
We find that the random initialed prior distribution in the content branch of the baseline DSVAE is not optimal to preserve the phonetic/content structure information.
The randomness of content embedding zc has a negative impact to phoneme classification and VC.
zcを埋め込んだコンテンツのランダム性は、音素分類とVCに悪影響を及ぼす。
0.69
To cope with this issue, we propose conditional DSVAE (C-DSVAE), an improved framework that corrects the randomness in the content prior distribution with content bias.
Alternative content biases extended from unsupervised learning, supervised learning and self-supervised learning are explored in this portion of study.
The VC experiments on VCTK dateset demonstrate a clear stabilized vocalization and a significantly improved performance with the new content embeddings.
Phoneme classification with zc also justifies the effectiveness of the proposed model in an objective way.
zcを用いた音素分類は,提案モデルの有効性を客観的に正当化する。
0.69
2. The DSVAE Baseline
2. DSVAE ベースライン
0.65
2.1. Related Work
2.1. 関連作品
0.51
DSVAE [17] was proposed as a sequential generative model that disentangles the time-invariant information from the timevariant information in the latent space.
Recently, we extended the DSVAE by balancing the information flow between speaker and content representations and it achieved the state-of-the-art performance for zero-shot non-parallel VC [15].
The shared encoder Eshare takes X as input and outputs a latent representation, with the speaker encoder ES and the content encoder EC modeling the posterior distribution qθ(zs|X) and qθ(zc|X) subsequently.
zs and zc are then sampled from qθ(zs|X) and qθ(zc|X).
zs と zc は qθ(zs|X) と qθ(zc|X) からサンプリングされる。
0.75
In the next stage, the decoder takes the concatenation of zs and zc, and passes them into decoder D to reconstruct the melspectrogram ˆX, i.e. ˆX = D(zs, zc).
Both the prior distribution pθ(z) and the posterior distribution qθ(z|X) are designed to follow the independence criterion, which is similar to [15, 17–19].
Specifically, they can be factorized as Eq (1) and Eq (2).
具体的には、eq (1) と eq (2) と分解することができる。
0.72
Note that we use qθ(zct|X) to model the content posterior since the content encoder consists of BiLSTM modules, which is slightly different from the streaming posterior qθ(xct|X< t) described in [17, 18], where they adopt unidirectional LSTM or RNN.
(4) (5) Given X1 as the source utterance and X2 as the target utterance for VC inference, the transferred sample is simply D(zs2, zc1), where zs2 and zc1 are sampled from qθ(zs|X2) and qθ(zc|X1).
We use a vocoder to convert the mel spec to the waveform.
我々は、vocoderを使用して、mel仕様を波形に変換する。
0.71
2.4. Implementation Details
2.4. 実施内容
0.38
Table. 1 provides detailed descriptions of each module of the DSVAE baseline.
テーブル。 DSVAEベースラインの各モジュールの詳細な説明を提供する。
0.72
For shared encoder and decoder, the instance normalization [20] is applied on both time and frequency axis.
共有エンコーダとデコーダでは、時間軸と周波数軸の両方にインスタンス正規化[20]を適用する。
0.77
For speaker encoder ES, content encoder EC and the content prior model pzc , two dense layers are used to model the mean and standard deviation of the q(zs|X), q(zct|X), p(zct|zc< t) respectively.
For the prior models, p(zs) is the standard normal distribution and pθ(zc) is modeled by an autoregressive LSTM: at each time step t, the model generates p(zct|zc< t), from which zct is sampled and taken as the input for next time step.
前のモデルでは、p(zs) は標準正規分布であり、pθ(zc) は自己回帰型LSTMによりモデル化される:各時間ステップ t において、モデルが p(zct|zc< t) を生成し、そこから zct をサンプリングして次の時間ステップの入力として取り込む。
0.78
Note that pθ(zc) is independent of the input data X. The decoder consists of a prenet and postnet, which is introduced in [10].
pθ(zc) は入力データ X とは独立であり、デコーダは[10] で導入されたプリネットとポストネットから構成される。
0.78
We use HiFi-GAN V1 [21] instead of WaveNet [22] as vocoder since HiFi-GAN results in better speech quality with much faster inference speed.
One problem for the vanilla DSVAEs [15, 17–19] is that the prior distribution is randomly initialized, thus it does not impose any constraint to regularize the posterior distribution.
Since the phonetic structure is explicitly modeled by qθ(zc|X), according to Eq 5, one of the objective is to minimize the KL divergence between qθ(zc|X) and pθ(zc).
In that sense, the learned phonetic structure qθ(zc|X) for all utterances will also follow the prior distribution, which does not reflect the real phonetic structure of the utterance.
Such phenomenon can be observed in Fig 2(a) and Fig 2(c) which gives the t-SNE [24] visualization of zc comparing the learned content embeddings from the pretrained DSVAE [15] and the raw melspectrogram of the same utterances.
It is observable that DSVAE representations are not phonetically discrimative in comparison to melspectrogram and they actually follow the random distribution.
The aforementioned problem is detrimental to disentanglement and will generate discontinuous speech with non-stable vocalizations.
上記の問題は絡み合いに有害であり、不安定な発声を伴う不連続な音声を生成する。
0.56
Our solution is that, instead of modeling pθ(zc), we will model the conditional content prior distribution pθ(zc|Y (X)) such that the prior distribution is meaningful in carrying the content information.
The expectation is that, by incorporating the content bias into the prior distribution pθ(zc), the posterior distribution qθ(zc|X) will retain the phonetic structure of X.
期待は、コンテンツバイアスを事前分布 pθ(zc) に組み込むことにより、後方分布 qθ(zc|x) が x の音声構造を保ち続けることである。
0.76
3.2. Proposed C-DSVAE
3.2. c-dsvaeの提案
0.39
Based on the aforementioned discussion, we introduce four conditional DSVAE candidates: C-DSVAE(Align), CDSVAE(BEST-RQ), C-DSVAE(Mel) and C-DSVAE(WavLM) based on different content bias source.
In order to let zc or qθ(zc|X) to keep the C-DSVAE(Align) phonetic structure of the speech data X, the content bias Y (X) is expected to carry the fine-grained phonetic information.
One natural choice is to let Y (X) be the forced alignment of X. To do so, we employ the Kaldi toolkit [25] to train a monophone model with 42 phonemes to obtain the forced alignment.
In the next step, the one-hot vectors are derived based on these labels for each frame, and are concatenated with the original inputs of pθ(zc) at each time step so that the new content prior becomes pθ(zc|Y (X)).
To handle this problem, we attempt to apply kmeans on the pretrained features.
この問題に対処するため、事前訓練された特徴に対してkmeansを適用する。
0.60
Specifically, we use the pre-trained WavLM features for kmeans clustering [29].
具体的には, kmeansクラスタリング [29] に事前訓練した WavLM の機能を利用する。
0.70
The advantage of WavLM is that the aforementioned bias from melspectrograms will be alleviated via iterative clustering and the masked prediction training process.
The other point is that WavLM acts as a teacher model so that the phonetic structure knowledge can be transferred from a larger corpus, which potentially improves the robustness and generalization capacity.
(7) (8) Following [15], we use the same training configuration for all experiments: the ADAM optimizer is used with the initial learning rate of 5e-4 [32].
We randomly select segments of 100 frames (1.6s) from the whole utterances for training.
学習用発話全体から100フレーム(1.6秒)のセグメントをランダムに選択する。
0.70
4.2. Experimental Results
4.2. 実験結果
0.58
C-DSVAE(BEST-RQ) Given the continuous representations as input, VQ-VAE [27] will derive the corresponding quantized vector as well as discrete indices by looking up in a closedset codebook.
C-DSVAE(Mel), C-DSVAE(Align) and C-DSVAE(WavLM) deliver much desired content distributions which successfully result in phonetically discriminative embeddings.
The phonetic structure of raw speech is retained and better disentanglement is expected.
生音声の音韻構造は維持され, より優れた絡み合いが期待できる。
0.59
We also perform phoneme classification to evaluate content embeddings in an objective way.
また,コンテンツ埋め込みを客観的に評価するために,音素分類を行う。
0.70
The phoneme classifer is mentioned in Sec. 2.4.
音素のクラスifer は sec. 2.4 で言及されている。
0.52
The consistent conclusion could be drawn that DSVAE and C-DSVAE(BEST-RQ) give lower accuracy.
DSVAEとC-DSVAE(BEST-RQ)は精度が低いという一貫した結論が導かれる。
0.66
The reason for which C-DSVAE(Mel), C-DSVAE(Align) and C-DSVAE(WavLM) outperform melspectrogram is that the latter contains the coarse-grained phonetic structure which can be improved via offline clustering.
C-DSVAE(Mel)、C-DSVAE(Align)、C-DSVAE(WavLM)がメルスペクトログラムより優れている理由は、後者がオフラインクラスタリングによって改善できる粗粒音素構造を含んでいるからである。 訳抜け防止モード: その理由は C - DSVAE (Mel ), C - DSVAE (Align ) and C - DSVAE (WavLM ) outform melspectrogram 後者は、オフラインクラスタリングによって改善できる粗い粒度の音声構造を含んでいる。
0.81
C-DSVAE(Align) is better than C-DSVAE(Mel) since alignment is obtained with a supervised alignment model.
C-DSVAE(WavLM) gives the best result because the masked language modeling and iterative clustering tend to capture better phonetic structure where the knowledge can also be transferred from the larger corpus.
Except for C-DSVAE(Mel), our proposed C-DSVAEs outperform the DSVAE baseline by a large margin in terms of naturalness and similarity under both seen to seen and unseen to unseen scenarios, and the MOS results are consistent with phoneme experiments as introduced in Sec 4.2.1.
The only exception is C-DSVAE(Mel), which achieves worse naturalness than CDSVAE(BEST-RQ) and worse similarity than DSVAE baseline, the potential reason is that speaker embeddings learned in C-DSVAE(Mel) are not as discriminative as those in either DSVAE baseline or other C-DSVAEs.
The trend is similar to the phoneme classification and VC MOS test, which indicates that stable content embeddings with more phonetic structure information boost the VC performance in both subjective and objective evaluations.
Table 4: Test accuracy for transferred voice verification across different models.
表4: 異なるモデル間で転送された音声検証のテスト精度。
0.77
5. Conclusion This paper proposes C-DSVAE, a novel voice conversion system that introduces the content bias to the prior modeling to enforce the content embeddings to retain the phonetic structure of the raw speech.
The VC experiments on VCTK dateset demonstrate a clear stabilized vocalization and a significantly improved performance with the new content embeddings.
With these contributions and progress, our C-DSVAE achieves stateof-the-art voice conversion performance.
これらの貢献と進歩により、我々のC-DSVAEは最先端の音声変換性能を達成する。
0.51
英語(論文から抽出)
日本語訳
スコア
[20] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
D. Ulyanov, A. Vedaldi, V. Lempitsky, “Instance normalization: The missing ingredients for fast stylization” arXiv preprint arXiv:1607.08022, 2016 訳抜け防止モード: [20 ]D. Ulyanov, A. Vedaldi, V. Lempitsky インスタンス正規化 : 高速スタイリングのための欠落成分 arXiv preprint arXiv:1607.08022, 2016
0.81
[21] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol.
21] j. kong, j. kim, and j. bae, “hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis” ニューラル情報処理システムにおける進歩。
0.79
33, pp. 17 022– 17 033, 2020.
33, pp. 17 022– 17 033, 2020.
0.46
[22] A. v. d.
[22] A. v. D.
0.46
Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, “Wavenet: A Generative model for raw audio” arXiv preprint arXiv:1609.03499, 2016
0.49
[23] C. Veaux, J. Yamagishi, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2017.
C. Veaux, J. Yamagishi, K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning Toolkit”. 2017年5月1日閲覧。
0.87
[24] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”
[24]L. Van der Maaten と G. Hinton, 「t-sne を用いたデータの可視化」
0.77
Journal of machine learning research, vol.
Journal of Machine Learning Research, vol. (英語)
0.70
9, no. 11, 2008.
2008年11月11日。
0.52
[25] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al , “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al , “The kaldi speech Recognition Toolkit” in IEEE 2011 Workshop on Automatic speech Recognition and understanding, No。 訳抜け防止モード: [25 ]D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek Y. Qian, P. Schwarz et al, “The kaldi speech Recognition Toolkit” IEEE 2011のワークショップでは、自動音声認識と理解について論じている。
0.88
CONF. IEEE Signal Processing Society, 2011.
CONF IEEE信号処理協会、2011年。
0.51
[26] W. -N.
[26]w。 -N。
0.38
Hsu, B. Bolte, Y.
Hsu, B. Bolte, Y。
0.46
-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.
-h。 H.Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 訳抜け防止モード: -h。 H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed 「ヒューバート : 隠蔽単位の隠蔽予測による教師付き音声表現学習」 IEEE / ACM Transactions on Audio, Speech, and Language Processing,vol
0.79
29, pp. 3451–3460, 2021.
29, pp. 3451–3460, 2021。
0.93
[27] A. Van Den Oord, O. Vinyals et al , “Neural discrete representation learning,” Advances in neural information processing systems, vol.
27] a. van den oord, o. vinyals et al, “neural discrete representation learning”, advances in neural information processing systems, vol. ニューラル情報処理システム。
0.80
30, 2017. [28] C.
30, 2017. [28]C。
0.40
-C. Chiu, J. Qin, Y. Zhang, J. Yu, and Y. Wu, “Self-supervised learning with random-projection quantizer for speech recognition,” arXiv preprint arXiv:2202.01855, 2022.
-C。 Chiu, J. Qin, Y. Zhang, J. Yu, Y. Wu, “Self-supervised learning with random-projection Quantizer for speech Recognition” arXiv preprint arXiv:2202.01855, 2022。
0.42
[29] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al , “Wavlm: Large-scale selfsupervised pre-training for full stack speech processing,” arXiv preprint arXiv:2110.13900, 2021.
[29]S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al , “Wavlm: Large-scale selfsupervised pre-training for full stack speech processing” arXiv preprint arXiv:2110.13900, 2021。
0.44
[30] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books”. 2015年のIEEE International Conference on acoustics, speech and signal processing (ICASSP)。 訳抜け防止モード: [30 ] V. Panayotov, G. Chen, D. Povey, そしてS.Khudanpurは、”Librispeech: an asr corpus based on public domain audio books”と言った。 2015年、IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) に参加。
0.85
IEEE, 2015, pp. 5206–5210.
IEEE, 2015, pp. 5206–5210。
0.91
[31] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of
D.Arthur and S. Vassilvitskii, “k-means++: The advantage of the most
0.38
careful seeding,” Stanford, Tech.
とスタンフォード工科大学は述べている。
0.53
Rep. , 2006.
代表。 , 2006.
0.40
[32] D. for https://arxiv.org/ab s/1412.6980
[32] D. for https://arxiv.org/ab s/1412.6980
0.26
P. Kingma stochastic optimization,”
P. Kingma 最適化”
0.52
and J. Ba, “Adam:
そして J. Ba 「アダム」
0.52
A method [Online].
方法[オンライン].
0.62
Available: 2014.
利用可能。 2014.
0.45
[33] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapatdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
B. Desplanques, J. Thienpondt, K. Demuynck, “Ecapatdnn: Emphasized channel attention, propagation and aggregate in tdnn based speaker validation, arXiv preprint arXiv:2005.07143, 2020”。 訳抜け防止モード: [33 ]B. Desplanques, J. Thienpondt, K. Demuynck Ecapatdnn : tdnnに基づく話者検証におけるチャネルアテンション, 伝播, 凝集の強調, arXiv preprint arXiv:2005.07143, 2020
0.86
[34] C. Zhang, J. Shi, C. Weng, M. Yu, and D. Yu, “Towards end-toend speaker diarization with generalized neural speaker clustering,” in IEEE ICASSP.
ieee icasspの[34] c. zhang氏, j. shi氏, c. weng氏, m. yu氏, d. yu氏は,ieee icasspで,“汎用ニューラルネットワーククラスタリングによるエンドツーエンドの話者ダイアリゼーションに向かって”と述べている。 訳抜け防止モード: [34 ]C.張、J.Shi、C.Weng、M.Yu、 そしてD. Yu氏は,“一般化されたニューラルスピーカークラスタリングによるエンドツーエンドの話者ダイアリゼーションを目指す”。 IEEE ICASSP で。
0.64
IEEE, 2022.
IEEE、2022年。
0.76
6. References [1] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.
6.参考文献 [1] B. Sisman, J. Yamagishi, S. King, H. Li, “An overview of voice conversion and its Challenge: from statistics modeling to Deep Learning”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020”。 訳抜け防止モード: 6.参考文献 [1]B.シスマン、J.山岸、S.キング H. Li は「音声変換の概要とその課題 : 統計的モデリングからディープラーニングまで」と述べている。 IEEE / ACM Transactions on Audio, Speech, and Language Processing, 2020
0.81
[2] F. Bahmaninezhad, C. Zhang, and J. H. Hansen, “Convolutional neural network based speaker de-identification.” in ISCA Odyssey, 2018.
[2] F. Bahmaninezhad, C. Zhang, J. H. Hansen, “Convolutional Neural Network based speaker de-identification.” in ISCA Odyssey, 2018 訳抜け防止モード: [2 ]F. Bahmaninezhad, C. Zhang, J. H. Hansen ISCA Odyssey, 2018 における「畳み込みニューラルネットワークに基づく話者識別」
0.84
[3] L. Zhang, C. Yu, H. Lu, C. Weng, C. Zhang, Y. Wu, X. Xie, Z. Li, and D. Yu, “Durian-sc: Duration informed attention network based singing voice conversion system,” arXiv preprint arXiv:2008.03009, 2020.
He3] L. Zhang, C. Yu, H. Lu, C. Weng, C. Zhang, Y. Wu, X. Xie, Z. Li, D. Yu, “Durian-sc: Duration informed attention network based singing voice conversion system” arXiv preprint arXiv:2008.03009, 2020。
0.45
[4] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series.” in KDD workshop, vol.
D. J. Berndt 氏と J. Clifford 氏は KDD ワークショップ で,“時系列のパターンを見つけるための動的時間ワープ” について説明している。
0.68
10, no. 16. Seattle, WA, USA:, 1994, pp. 359–370.
10番 16番 シアトル, WA, USA:, 1994, pp. 359–370。
0.61
[5] J. -X. Zhang, Z.
5]j。 -X。 zhang (複数形 zhangs)
0.33
-H. Ling, L.
-h。 Ling, L。
0.58
-J. Liu, Y. Jiang, and L.
-j。 Liu, Y. Jiang, and L。
0.81
-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.
-R。 Dai, “Sequence-to-Sequence Acoustic Modeling for voice conversion”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol。 訳抜け防止モード: -R。 dai, “ sequence - to - sequence acoustic modeling for voice conversion” ieee / acm 音声、音声、言語処理に関するトランザクション。
0.60
27, no. 3, pp. 631–644, 2019.
27, No. 3, pp. 631–644, 2019。
0.48
[6] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in 2016 IEEE International Conference on Multimedia and Expo (ICME).
6] l. sun, k. li, h. wang, s. kang, h. meng, “phonetic posteriorgrams for many-to-one voice conversion without parallel data training” 2016 ieee international conference on multimedia and expo (icme) で発表された。
0.80
IEEE, 2016, pp. 1–6.
2016年、p.1-6。
0.42
[7] S. Liu, J. Zhong, L. Sun, X. Wu, X. Liu, and H. Meng, “Voice conversion across arbitrary speakers based on a single target-speaker utterance,” Proc.
7] s. liu, j. zhong, l. sun, x. wu, x. liu, h. meng, “単一のターゲット話者発話に基づいて任意の話者間の音声変換”。
0.68
Interspeech 2018, pp. 496–500, 2018.
2018年、p.496-500。
0.53
[8] H. Guo, H. Lu, N. Hu, C. Zhang, S. Yang, L. Xie, D. Su, and D. Yu, “Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training,” arXiv preprint arXiv:2012.01837, 2020.
H.Guo, H. Lu, N. Hu, C. Zhang, S. Yang, L. Xie, D. Su, D. Yu, “Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training” arXiv preprint arXiv:2012.01837, 2020”. 2020年3月3日閲覧。 訳抜け防止モード: [8 ]H.Guo,H. Lu,N. Hu, C. Zhang, S. Yang, L. Xie, D. Su とD. Yuは言う。 声道後音節は、多くの -to - 多くの歌唱音声変換を相手の訓練によって行う」。 arXiv preprint arXiv:2012.01837 , 2020
0.65
[9] M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.
9] M. Zhang, Y. Zhou, L. Zhao, H. Li, “音声合成から非並列学習データによる音声変換へのトランスファー学習”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol。
0.89
29, pp. 1290–1302, 2021.
29, pp. 1290–1302, 2021。
0.93
[10] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. HasegawaJohnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning.
K. Qian, Y. Zhang, S. Chang, X. Yang, M. HasegawaJohnson, “Autovc: Zero-shot voice style Transfer with only autoencoder loss”. International Conference on Machine Learning. (英語) 訳抜け防止モード: 10 ] k. qian, y. zhang, s. chang, x. yang, and m. hasegawajohnson, “autovc : zero - shot voice style transfer with only autoencoder loss” 機械学習に関する国際会議に出席。
0.84
PMLR, 2019, pp. 5210–5219.
pmlr, 2019, pp. 5210-5219。
0.75
[11] J. -c.
[11] J。 -c。
0.39
Chou, C. -c.
Chou, C。 -c。
0.41
Yeh, and H. -y.
とHは言う。 そうです。
0.54
Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” arXiv preprint arXiv:1904.05742, 2019.
Lee, “One-shot voice conversion by split speaker and content representations with instance normalization” arXiv preprint arXiv:1904.05742, 2019。
0.46
[12] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks,” in 2018 IEEE Spoken Language Technology Workshop (SLT).
12] h. kameoka, t. kaneko, k. tanaka, n. hojo, “stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks” in 2018 ieee spoken language technology workshop (slt) 訳抜け防止モード: [12]H.亀岡、T.金子、K.田中 N.hoho, “スターガン - vc : Non - parallel many - to - many voice conversion using star generative adversarial network”. 2018年、IEEE Spoken Language Technology Workshop (SLT)。
0.71
IEEE, 2018, pp. 266–273.
IEEE, 2018, pp. 266–273。
0.45
[13] T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 2018 26th European Signal Processing Conference (EUSIPCO).
第26回欧州信号処理会議(eusipco)では,[13] t. kaneko と h. kameoka が,“cyclegan-vc: non-parallel voice conversion using cycle- consistent adversarial networks”と題した講演を行った。 訳抜け防止モード: 13 ] kaneko, h. kameoka, "cyclegan - vc : non-parallel voice conversion using cycle- consistent adversarial networks" 2018年 - 第26回欧州信号処理会議(eusipco)が開催。
0.77
IEEE, 2018, pp. 2100–2104.
IEEE, 2018, pp. 2100-2104。
0.84
[14] Y. A. Li, A. Zare, and N. Mesgarani, “Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,” in Interspeech, 2021.
Y. A. Li, A. Zare, N. Mesgarani, “Starganv2-vc: A various, unsupervised, non-parallel framework for natural-sounding voice conversion”. Interspeech, 2021年。 訳抜け防止モード: [14 ]Y. A. Li, A. Zare, N. Mesgarani スターガン2-vc : 自然音声変換のための多様で教師なし、非並列なフレームワーク」 インタースペーチ』2021年。
0.73
[15] J. Lian, C. Zhang, and D. Yu, “Robust disentangled variational speech representation learning for zero-shot voice conversion,” in IEEE ICASSP.
J. Lian, C. Zhang, D. Yu, “Robust disentangled variational speech representation learning for zero-shot voice conversion” in IEEE ICASSP。 訳抜け防止モード: [15 ]J.Lian、C.Zhang、D.Yu 「ゼロ・ショット音声変換のためのロバストなアンタングル型変分音声表現学習」 IEEE ICASSP で。
0.68
IEEE, 2022.
IEEE、2022年。
0.76
[16] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
16] d. p. kingma and m. welling, "auto-encoding variational bayes"
0.47
2014. [17] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” arXiv
[18] Y. Zhu, M. R. Min, A. Kadav, and H. P. Graf, “S3vae: Selfsupervised sequential vae for representation disentanglement and data generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6538–6547.
[18] Y. Zhu, M. R. Min, A. Kadav, H. P. Graf, “S3vae: Selfsupervised sequence vae for representation disentanglement and data generation” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6538–6547。 訳抜け防止モード: [18 ]Y. Zhu, M. R. Min, A. Kadav, H. P. Graf 「S3vae : 表現のゆがみとデータ生成のための自己監督型シーケンシャルベイ」 IEEE / CVF Conference on Computer Vision に参加して and Pattern Recognition , 2020 , pp. 6538–6547.
0.85
[19] J. Bai, W. Wang, and C. P. Gomes, “Contrastively disentangled sequential variational autoencoder,” Advances in Neural Information Processing Systems, vol.
19] j. bai, w. wang, c. p. gomes, “contrastively disentangled sequential variational autoencoder”, ニューラル情報処理システムにおける進歩。