Singer Identity Representation Learning using Self-Supervised Techniques
- URL: http://arxiv.org/abs/2401.05064v1
- Date: Wed, 10 Jan 2024 10:41:38 GMT
- Title: Singer Identity Representation Learning using Self-Supervised Techniques
- Authors: Bernardo Torres, Stefan Lattner and Ga\"el Richard
- Abstract summary: We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks.
We explore different self-supervised learning techniques on a large collection of isolated vocal tracks.
We evaluate the quality of the resulting representations on singer similarity and identification tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Significant strides have been made in creating voice identity representations
using speech data. However, the same level of progress has not been achieved
for singing voices. To bridge this gap, we suggest a framework for training
singer identity encoders to extract representations suitable for various
singing-related tasks, such as singing voice similarity and synthesis. We
explore different self-supervised learning techniques on a large collection of
isolated vocal tracks and apply data augmentations during training to ensure
that the representations are invariant to pitch and content variations. We
evaluate the quality of the resulting representations on singer similarity and
identification tasks across multiple datasets, with a particular emphasis on
out-of-domain generalization. Our proposed framework produces high-quality
embeddings that outperform both speaker verification and wav2vec 2.0
pre-trained baselines on singing voice while operating at 44.1 kHz. We release
our code and trained models to facilitate further research on singing voice and
related areas.
Related papers
- GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks [52.30565320125514]
GTSinger is a large global, multi-technique, free-to-use, high-quality singing corpus with realistic music scores.
We collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset.
We conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion.
arXiv Detail & Related papers (2024-09-20T18:18:14Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.
StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Audiovisual Singing Voice Separation [25.862550744570324]
Video model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework.
We create two audiovisual singing performance datasets for training and evaluation.
The proposed method outperforms audio-based methods in terms of separation quality on most test recordings.
arXiv Detail & Related papers (2021-07-01T06:04:53Z) - VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN.
We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content.
By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Addressing the confounds of accompaniments in singer identification [29.949390919663596]
We employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music.
We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data.
arXiv Detail & Related papers (2020-02-17T07:49:21Z) - Speech-to-Singing Conversion in an Encoder-Decoder Framework [38.111942306157545]
We take a learning based approach to the problem of converting spoken lines into sung ones.
We learn encodings that enable us to synthesize singing that preserves the linguistic content and timbre of the speaker.
arXiv Detail & Related papers (2020-02-16T15:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.