Related papers: S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

URL: http://arxiv.org/abs/2110.06280v1
Date: Tue, 12 Oct 2021 19:01:52 GMT
Title: S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations
Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda
Abstract summary: This paper introduces S3PRL-VC, an open-source voice conversion framework based on the S3PRL toolkit. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC 2020. We show that S3R is comparable with VCC 2020 top systems in the A2O setting in terms of similarity, and state-of-the-art in S3R-based A2A VC.
Score: 124.2620985250939
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting. We also provide comparisons between not only different S3Rs but also top systems in VCC2020 with supervised representations. Systematic objective and subjective evaluation were conducted, and we show that S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC. We believe the extensive analysis, as well as the toolkit itself, contribute to not only the S3R community but also the VC community. The codebase is now open-sourced.

Related papers

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [77.28814034644287]
We propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context. We evaluate SVTRv2 in both standard and recent challenging benchmarks.
arXiv Detail & Related papers (2024-11-24T14:21:35Z)
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance. We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z)
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization [51.33923845954759]
3D Visual Grounding (3DVG) and 3D Captioning (3DDC) are two crucial tasks in various 3D applications.<n>We propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks.<n>In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection.
arXiv Detail & Related papers (2024-04-17T04:46:27Z)
RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech Recognition [43.081758770899235]
We present RASR2, a research-oriented generic S2S decoder implemented in C++. It offers a strong flexibility/compatibility for various S2S models, language models, label units/topologies and neural network architectures. It provides efficient decoding for both open- and closed-vocabulary scenarios based on a generalized search framework with rich support for different search modes and settings.
arXiv Detail & Related papers (2023-05-28T17:48:48Z)
Iteratively Improving Speech Recognition and Voice Conversion [10.514009693947227]
We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models.
arXiv Detail & Related papers (2023-05-24T11:45:42Z)
Self-supervised Learning by View Synthesis [62.27092994474443]
We present view-synthesis autoencoders (VSA) in this paper, which is a self-supervised learning framework designed for vision transformers. In each iteration, the input to VSA is one view (or multiple views) of a 3D object and the output is a synthesized image in another target pose.
arXiv Detail & Related papers (2023-04-22T06:12:13Z)
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations. Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z)
Non-Parallel Voice Conversion for ASR Augmentation [23.95732033698818]
Voice conversion can be used as a data augmentation technique to improve ASR performance. Despite including many speakers, speaker diversity may remain a limitation to ASR quality.
arXiv Detail & Related papers (2022-09-15T00:40:35Z)
A Comparative Study of Self-supervised Speech Representation Based Voice Conversion [47.250866153881645]
We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC) We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means and showed how it improves in the A2A setting.
arXiv Detail & Related papers (2022-07-10T01:02:22Z)
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques [3.3946853660795893]
We propose Assem-VC, a new state-of-the-art any-to-many non-parallel voice conversion system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs.
arXiv Detail & Related papers (2021-04-02T08:18:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.