Related papers: A Comparative Study of Self-supervised Speech Representation Based Voice Conversion

A Comparative Study of Self-supervised Speech Representation Based Voice Conversion

URL: http://arxiv.org/abs/2207.04356v1
Date: Sun, 10 Jul 2022 01:02:22 GMT
Title: A Comparative Study of Self-supervised Speech Representation Based Voice Conversion
Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Tomoki Toda
Abstract summary: We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC) We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means and showed how it improves in the A2A setting.
Score: 47.250866153881645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions.

Related papers

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance. We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
Iteratively Improving Speech Recognition and Voice Conversion [10.514009693947227]
We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models.
arXiv Detail & Related papers (2023-05-24T11:45:42Z)
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations. Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z)
Conditional Deep Hierarchical Variational Autoencoder for Voice Conversion [5.538544897623972]
Variational autoencoder-based voice conversion (VAE-VC) has the advantage of requiring only pairs of speeches and speaker labels for training. This paper investigates how an increasing model expressiveness has benefits and impacts on the VAE-VC.
arXiv Detail & Related papers (2021-12-06T05:54:11Z)
S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations [124.2620985250939]
This paper introduces S3PRL-VC, an open-source voice conversion framework based on the S3PRL toolkit. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC 2020. We show that S3R is comparable with VCC 2020 top systems in the A2O setting in terms of similarity, and state-of-the-art in S3R-based A2A VC.
arXiv Detail & Related papers (2021-10-12T19:01:52Z)
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques [3.3946853660795893]
We propose Assem-VC, a new state-of-the-art any-to-many non-parallel voice conversion system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs.
arXiv Detail & Related papers (2021-04-02T08:18:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.