The Singing Voice Conversion Challenge 2023
- URL: http://arxiv.org/abs/2306.14422v2
- Date: Thu, 6 Jul 2023 08:17:31 GMT
- Title: The Singing Voice Conversion Challenge 2023
- Authors: Wen-Chin Huang, Lester Phillip Violeta, Songxiang Liu, Jiatong Shi,
Tomoki Toda
- Abstract summary: This year we shifted our focus to singing voice conversion (SVC)
A new database was constructed for two tasks, namely in-domain and cross-domain SVC.
We observed that for both tasks, although human-level naturalness was achieved by the top system, no team was able to obtain a similarity score as high as the target speakers.
- Score: 35.270322663776646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the latest iteration of the voice conversion challenge (VCC)
series, a bi-annual scientific event aiming to compare and understand different
voice conversion (VC) systems based on a common dataset. This year we shifted
our focus to singing voice conversion (SVC), thus named the challenge the
Singing Voice Conversion Challenge (SVCC). A new database was constructed for
two tasks, namely in-domain and cross-domain SVC. The challenge was run for two
months, and in total we received 26 submissions, including 2 baselines. Through
a large-scale crowd-sourced listening test, we observed that for both tasks,
although human-level naturalness was achieved by the top system, no team was
able to obtain a similarity score as high as the target speakers. Also, as
expected, cross-domain SVC is harder than in-domain SVC, especially in the
similarity aspect. We also investigated whether existing objective measurements
were able to predict perceptual performance, and found that only few of them
could reach a significant correlation.
Related papers
- SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion [12.454955437047573]
We propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC)
We introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance.
arXiv Detail & Related papers (2024-06-09T08:34:01Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - A Comparative Study of Voice Conversion Models with Large-Scale Speech
and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge
2023 [40.48355334150661]
This paper presents our systems for the singing voice conversion challenge (SVCC) 2023.
For both in-domain and cross-domain English singing voice conversion tasks, we adopt a recognition-synthesis approach with self-supervised learning-based representation.
Large-scale listening tests conducted by SVCC 2023 show that our T13 system achieves competitive naturalness and speaker similarity for the harder cross-domain SVC.
arXiv Detail & Related papers (2023-10-08T15:30:44Z) - Robust One-Shot Singing Voice Conversion [28.707278256253385]
High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation.
We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices.
Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
arXiv Detail & Related papers (2022-10-20T08:47:35Z) - The 2021 NIST Speaker Recognition Evaluation [1.5282767384702267]
The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996.
This paper presents an overview of SRE21 including the tasks, performance metric, data, evaluation protocol, results and system performance analyses.
arXiv Detail & Related papers (2022-04-21T16:18:52Z) - Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z) - NTIRE 2021 Multi-modal Aerial View Object Classification Challenge [88.89190054948325]
We introduce the first Challenge on Multi-modal Aerial View Object Classification (MAVOC) in conjunction with the NTIRE 2021 workshop at CVPR.
This challenge is composed of two different tracks using EO and SAR imagery.
We discuss the top methods submitted for this competition and evaluate their results on our blind test set.
arXiv Detail & Related papers (2021-07-02T16:55:08Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Should we hard-code the recurrence concept or learn it instead ?
Exploring the Transformer architecture for Audio-Visual Speech Recognition [10.74796391075403]
We present a variant of AV Align where the recurrent Long Short-term Memory (LSTM) block is replaced by the more recently proposed Transformer block.
We find that Transformers also learn cross-modal monotonic alignments, but suffer from the same visual convergence problems as the LSTM model.
arXiv Detail & Related papers (2020-05-19T09:06:39Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.