S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised
Speech Representations
- URL: http://arxiv.org/abs/2110.06280v1
- Date: Tue, 12 Oct 2021 19:01:52 GMT
- Title: S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised
Speech Representations
- Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji
Watanabe, Tomoki Toda
- Abstract summary: This paper introduces S3PRL-VC, an open-source voice conversion framework based on the S3PRL toolkit.
In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC 2020.
We show that S3R is comparable with VCC 2020 top systems in the A2O setting in terms of similarity, and state-of-the-art in S3R-based A2A VC.
- Score: 124.2620985250939
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces S3PRL-VC, an open-source voice conversion (VC)
framework based on the S3PRL toolkit. In the context of recognition-synthesis
VC, self-supervised speech representation (S3R) is valuable in its potential to
replace the expensive supervised representation adopted by state-of-the-art VC
systems. Moreover, we claim that VC is a good probing task for S3R analysis. In
this work, we provide a series of in-depth analyses by benchmarking on the two
tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as
an any-to-any (A2A) setting. We also provide comparisons between not only
different S3Rs but also top systems in VCC2020 with supervised representations.
Systematic objective and subjective evaluation were conducted, and we show that
S3R is comparable with VCC2020 top systems in the A2O setting in terms of
similarity, and achieves state-of-the-art in S3R-based A2A VC. We believe the
extensive analysis, as well as the toolkit itself, contribute to not only the
S3R community but also the VC community. The codebase is now open-sourced.
Related papers
- Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization [51.33923845954759]
3D Visual Grounding (3DVG) and 3D Captioning (3DDC) are two crucial tasks in various 3D applications.
We propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks.
In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection.
arXiv Detail & Related papers (2024-04-17T04:46:27Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech
Recognition [43.081758770899235]
We present RASR2, a research-oriented generic S2S decoder implemented in C++.
It offers a strong flexibility/compatibility for various S2S models, language models, label units/topologies and neural network architectures.
It provides efficient decoding for both open- and closed-vocabulary scenarios based on a generalized search framework with rich support for different search modes and settings.
arXiv Detail & Related papers (2023-05-28T17:48:48Z) - Iteratively Improving Speech Recognition and Voice Conversion [10.514009693947227]
We first train an ASR model which is used to ensure content preservation while training a VC model.
In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers.
By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models.
arXiv Detail & Related papers (2023-05-24T11:45:42Z) - Self-supervised Learning by View Synthesis [62.27092994474443]
We present view-synthesis autoencoders (VSA) in this paper, which is a self-supervised learning framework designed for vision transformers.
In each iteration, the input to VSA is one view (or multiple views) of a 3D object and the output is a synthesized image in another target pose.
arXiv Detail & Related papers (2023-04-22T06:12:13Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Non-Parallel Voice Conversion for ASR Augmentation [23.95732033698818]
Voice conversion can be used as a data augmentation technique to improve ASR performance.
Despite including many speakers, speaker diversity may remain a limitation to ASR quality.
arXiv Detail & Related papers (2022-09-15T00:40:35Z) - A Comparative Study of Self-supervised Speech Representation Based Voice
Conversion [47.250866153881645]
We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC)
We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision.
We also studied the effect of a post-discretization process with k-means and showed how it improves in the A2A setting.
arXiv Detail & Related papers (2022-07-10T01:02:22Z) - Assem-VC: Realistic Voice Conversion by Assembling Modern Speech
Synthesis Techniques [3.3946853660795893]
We propose Assem-VC, a new state-of-the-art any-to-many non-parallel voice conversion system.
This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs.
arXiv Detail & Related papers (2021-04-02T08:18:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.