On the Effectiveness of Speech Self-supervised Learning for Music
- URL: http://arxiv.org/abs/2307.05161v1
- Date: Tue, 11 Jul 2023 10:37:57 GMT
- Title: On the Effectiveness of Speech Self-supervised Learning for Music
- Authors: Yinghao Ma, Ruibin Yuan, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin,
Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Ruibo Liu, Gus
Xia, Roger Dannenberg, Yike Guo, Jie Fu
- Abstract summary: Self-sourced learning (SSL) has shown promising results in various speech and natural language processing applications.
We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, respectively.
Our findings suggest that training with music data can generally improve performance on MIR tasks, even when models are trained using paradigms designed for speech.
- Score: 45.43336822496942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) has shown promising results in various speech
and natural language processing applications. However, its efficacy in music
information retrieval (MIR) still remains largely unexplored. While previous
SSL models pre-trained on music recordings may have been mostly closed-sourced,
recent speech models such as wav2vec2.0 have shown promise in music modelling.
Nevertheless, research exploring the effectiveness of applying speech SSL
models to music recordings has been limited. We explore the music adaption of
SSL with two distinctive speech-related models, data2vec1.0 and Hubert, and
refer to them as music2vec and musicHuBERT, respectively. We train $12$ SSL
models with 95M parameters under various pre-training configurations and
systematically evaluate the MIR task performances with 13 different MIR tasks.
Our findings suggest that training with music data can generally improve
performance on MIR tasks, even when models are trained using paradigms designed
for speech. However, we identify the limitations of such existing
speech-oriented designs, especially in modelling polyphonic information. Based
on the experimental results, empirical suggestions are also given for designing
future musical SSL strategies and paradigms.
Related papers
- Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic
Singing Voice Understanding Tasks: Three Case Studies [1.2691047660244337]
Self-supervised learning models (SSL models) have been trained using large amounts of unlabeled data in the field of speech processing and music classification.
We report the results of experiments comparing SSL models for three different tasks (i.e., singer identification, singing voice transcription, and singing technique classification) as initial exploration and aim to discuss these findings.
arXiv Detail & Related papers (2023-06-22T07:47:18Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music
Audio Representation Learning [41.633972123961094]
Music2Vec is a framework exploring different SSL algorithmic components and tricks for music audio recordings.
Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter.
arXiv Detail & Related papers (2022-12-05T16:04:26Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data.
This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks.
Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.