A Comparative Study of Voice Conversion Models with Large-Scale Speech
and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge
2023
- URL: http://arxiv.org/abs/2310.05203v1
- Date: Sun, 8 Oct 2023 15:30:44 GMT
- Title: A Comparative Study of Voice Conversion Models with Large-Scale Speech
and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge
2023
- Authors: Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin
Huang, Tomoki Toda
- Abstract summary: This paper presents our systems for the singing voice conversion challenge (SVCC) 2023.
For both in-domain and cross-domain English singing voice conversion tasks, we adopt a recognition-synthesis approach with self-supervised learning-based representation.
Large-scale listening tests conducted by SVCC 2023 show that our T13 system achieves competitive naturalness and speaker similarity for the harder cross-domain SVC.
- Score: 40.48355334150661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents our systems (denoted as T13) for the singing voice
conversion challenge (SVCC) 2023. For both in-domain and cross-domain English
singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a
recognition-synthesis approach with self-supervised learning-based
representation. To achieve data-efficient SVC with a limited amount of target
singer/speaker's data (150 to 160 utterances for SVCC 2023), we first train a
diffusion-based any-to-any voice conversion model using publicly available
large-scale 750 hours of speech and singing data. Then, we finetune the model
for each target singer/speaker of Task 1 and Task 2. Large-scale listening
tests conducted by SVCC 2023 show that our T13 system achieves competitive
naturalness and speaker similarity for the harder cross-domain SVC (Task 2),
which implies the generalization ability of our proposed method. Our objective
evaluation results show that using large datasets is particularly beneficial
for cross-domain SVC.
Related papers
- Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024 [8.940008511570207]
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER)
The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices.
The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task.
arXiv Detail & Related papers (2024-09-03T21:28:45Z) - SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion [12.454955437047573]
We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC)
We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
arXiv Detail & Related papers (2024-06-09T08:34:01Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - The GUA-Speech System Description for CNVSRC Challenge 2023 [8.5257557043542]
This study describes our system for Task 1 Single-speaker Visual Speech Recognition (VSR) fixed track in the Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023.
We use intermediate connectionist temporal classification (Inter CTC) residual modules to relax the conditional independence assumption of CTC in our model.
We also use a bi-transformer decoder to enable the model to capture both past and future contextual information.
arXiv Detail & Related papers (2023-12-12T13:35:33Z) - The Singing Voice Conversion Challenge 2023 [35.270322663776646]
This year we shifted our focus to singing voice conversion (SVC)
A new database was constructed for two tasks, namely in-domain and cross-domain SVC.
We observed that for both tasks, although human-level naturalness was achieved by the top system, no team was able to obtain a similarity score as high as the target speakers.
arXiv Detail & Related papers (2023-06-26T05:04:58Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - On the pragmatism of using binary classifiers over data intensive neural
network classifiers for detection of COVID-19 from voice [34.553128768223615]
We show that detecting COVID-19 from voice does not require custom-made non-standard features or complicated neural network classifiers.
We demonstrate this from a human-curated dataset collected and calibrated in clinical settings.
arXiv Detail & Related papers (2022-04-11T00:19:14Z) - Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.