MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music
Audio Representation Learning
- URL: http://arxiv.org/abs/2212.02508v1
- Date: Mon, 5 Dec 2022 16:04:26 GMT
- Title: MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music
Audio Representation Learning
- Authors: Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Chenghua Lin, Xingran
Chen, Anton Ragni, Hanzhi Yin, Zhijie Hu, Haoyu He, Emmanouil Benetos,
Norbert Gyenge, Ruibo Liu and Jie Fu
- Abstract summary: Music2Vec is a framework exploring different SSL algorithmic components and tricks for music audio recordings.
Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter.
- Score: 41.633972123961094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The deep learning community has witnessed an exponentially growing interest
in self-supervised learning (SSL). However, it still remains unexplored how to
build a framework for learning useful representations of raw music waveforms in
a self-supervised manner. In this work, we design Music2Vec, a framework
exploring different SSL algorithmic components and tricks for music audio
recordings. Our model achieves comparable results to the state-of-the-art
(SOTA) music SSL model Jukebox, despite being significantly smaller with less
than 2% of parameters of the latter. The model will be released on
Huggingface(Please refer to: https://huggingface.co/m-a-p/music2vec-v1)
Related papers
- On the Effectiveness of Speech Self-supervised Learning for Music [45.43336822496942]
Self-sourced learning (SSL) has shown promising results in various speech and natural language processing applications.
We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, respectively.
Our findings suggest that training with music data can generally improve performance on MIR tasks, even when models are trained using paradigms designed for speech.
arXiv Detail & Related papers (2023-07-11T10:37:57Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - Contrastive Learning with Positive-Negative Frame Mask for Music
Representation [91.44187939465948]
This paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR.
We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
arXiv Detail & Related papers (2022-03-17T07:11:42Z) - Audio-to-symbolic Arrangement via Cross-modal Music Representation
Learning [11.247238840604282]
A good arrangement model should not only consider the audio content but also have prior knowledge of piano composition.
We contribute a cross-modal representation-learning model, which extracts chord and melodic information from the audio.
Experiments show that our model captures major audio information and outperforms baselines in generation quality.
arXiv Detail & Related papers (2021-12-30T16:05:30Z) - Contrastive Learning of General-Purpose Audio Representations [33.15189569532155]
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio.
We build on recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement model of audio.
arXiv Detail & Related papers (2020-10-21T11:56:22Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.