A Comparative Study of Self-supervised Speech Representation Based Voice
Conversion
- URL: http://arxiv.org/abs/2207.04356v1
- Date: Sun, 10 Jul 2022 01:02:22 GMT
- Title: A Comparative Study of Self-supervised Speech Representation Based Voice
Conversion
- Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Tomoki Toda
- Abstract summary: We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC)
We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision.
We also studied the effect of a post-discretization process with k-means and showed how it improves in the A2A setting.
- Score: 47.250866153881645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a large-scale comparative study of self-supervised speech
representation (S3R)-based voice conversion (VC). In the context of
recognition-synthesis VC, S3Rs are attractive owing to their potential to
replace expensive supervised representations such as phonetic posteriorgrams
(PPGs), which are commonly adopted by state-of-the-art VC systems. Using
S3PRL-VC, an open-source VC software we previously developed, we provide a
series of in-depth objective and subjective analyses under three VC settings:
intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice
conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in
various aspects, including model type, multilinguality, and supervision. We
also studied the effect of a post-discretization process with k-means
clustering and showed how it improves in the A2A setting. Finally, the
comparison with state-of-the-art VC systems demonstrates the competitiveness of
S3R-based VC and also sheds light on the possible improving directions.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Iteratively Improving Speech Recognition and Voice Conversion [10.514009693947227]
We first train an ASR model which is used to ensure content preservation while training a VC model.
In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers.
By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models.
arXiv Detail & Related papers (2023-05-24T11:45:42Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Conditional Deep Hierarchical Variational Autoencoder for Voice
Conversion [5.538544897623972]
Variational autoencoder-based voice conversion (VAE-VC) has the advantage of requiring only pairs of speeches and speaker labels for training.
This paper investigates how an increasing model expressiveness has benefits and impacts on the VAE-VC.
arXiv Detail & Related papers (2021-12-06T05:54:11Z) - S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised
Speech Representations [124.2620985250939]
This paper introduces S3PRL-VC, an open-source voice conversion framework based on the S3PRL toolkit.
In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC 2020.
We show that S3R is comparable with VCC 2020 top systems in the A2O setting in terms of similarity, and state-of-the-art in S3R-based A2A VC.
arXiv Detail & Related papers (2021-10-12T19:01:52Z) - Assem-VC: Realistic Voice Conversion by Assembling Modern Speech
Synthesis Techniques [3.3946853660795893]
We propose Assem-VC, a new state-of-the-art any-to-many non-parallel voice conversion system.
This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs.
arXiv Detail & Related papers (2021-04-02T08:18:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.