Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic
Singing Voice Understanding Tasks: Three Case Studies
- URL: http://arxiv.org/abs/2306.12714v2
- Date: Tue, 5 Sep 2023 06:20:11 GMT
- Title: Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic
Singing Voice Understanding Tasks: Three Case Studies
- Authors: Yuya Yamamoto
- Abstract summary: Self-supervised learning models (SSL models) have been trained using large amounts of unlabeled data in the field of speech processing and music classification.
We report the results of experiments comparing SSL models for three different tasks (i.e., singer identification, singing voice transcription, and singing technique classification) as initial exploration and aim to discuss these findings.
- Score: 1.2691047660244337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic singing voice understanding tasks, such as singer identification,
singing voice transcription, and singing technique classification, benefit from
data-driven approaches that utilize deep learning techniques. These approaches
work well even under the rich diversity of vocal and noisy samples owing to
their representation ability. However, the limited availability of labeled data
remains a significant obstacle to achieving satisfactory performance. In recent
years, self-supervised learning models (SSL models) have been trained using
large amounts of unlabeled data in the field of speech processing and music
classification. By fine-tuning these models for the target tasks, comparable
performance to conventional supervised learning can be achieved with limited
training data. Therefore, in this paper, we investigate the effectiveness of
SSL models for various singing voice recognition tasks. We report the results
of experiments comparing SSL models for three different tasks (i.e., singer
identification, singing voice transcription, and singing technique
classification) as initial exploration and aim to discuss these findings.
Experimental results show that each SSL model achieves comparable performance
and sometimes outperforms compared to state-of-the-art methods on each task. We
also conducted a layer-wise analysis to further understand the behavior of the
SSL models.
Related papers
- Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - On the Effectiveness of Speech Self-supervised Learning for Music [45.43336822496942]
Self-sourced learning (SSL) has shown promising results in various speech and natural language processing applications.
We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, respectively.
Our findings suggest that training with music data can generally improve performance on MIR tasks, even when models are trained using paradigms designed for speech.
arXiv Detail & Related papers (2023-07-11T10:37:57Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - Why does Self-Supervised Learning for Speech Recognition Benefit Speaker
Recognition? [86.53044183309824]
We study which factor leads to the success of self-supervised learning on speaker-related tasks.
Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size.
arXiv Detail & Related papers (2022-04-27T08:35:57Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data.
This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks.
Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.