Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems
- URL: http://arxiv.org/abs/2309.12134v1
- Date: Thu, 21 Sep 2023 14:54:48 GMT
- Title: Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems
- Authors: Luis Carvalho, Tobias Wash\"uttl and Gerhard Widmer
- Abstract summary: We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content.
We employ the snippet embeddings in the higher-level task of cross-modal piece identification.
In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
- Score: 3.997809845676912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linking sheet music images to audio recordings remains a key problem for the
development of efficient cross-modal music retrieval systems. One of the
fundamental approaches toward this task is to learn a cross-modal embedding
space via deep neural networks that is able to connect short snippets of audio
and sheet music. However, the scarcity of annotated data from real musical
content affects the capability of such methods to generalize to real retrieval
scenarios. In this work, we investigate whether we can mitigate this limitation
with self-supervised contrastive learning, by exposing a network to a large
amount of real music data as a pre-training step, by contrasting randomly
augmented views of snippets of both modalities, namely audio and sheet images.
Through a number of experiments on synthetic and real piano data, we show that
pre-trained models are able to retrieve snippets with better precision in all
scenarios and pre-training configurations. Encouraged by these results, we
employ the snippet embeddings in the higher-level task of cross-modal piece
identification and conduct more experiments on several retrieval
configurations. In this task, we observe that the retrieval quality improves
from 30% up to 100% when real music data is present. We then conclude by
arguing for the potential of self-supervised contrastive learning for
alleviating the annotated data scarcity in multi-modal music retrieval models.
Related papers
- An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging [6.363158395541767]
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data.
In this study, we investigate and compare the performance of new self-supervised methods for music tagging.
arXiv Detail & Related papers (2024-04-14T07:56:08Z) - Passage Summarization with Recurrent Models for Audio-Sheet Music
Retrieval [4.722882736419499]
Cross-modal music retrieval can connect sheet music images to audio recordings.
We propose a cross-modal recurrent network that learns joint embeddings to summarize longer passages of corresponding audio and sheet music.
We conduct a number of experiments on synthetic and real piano data and scores, showing that our proposed recurrent method leads to more accurate retrieval in all possible configurations.
arXiv Detail & Related papers (2023-09-21T14:30:02Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Comparision Of Adversarial And Non-Adversarial LSTM Music Generative
Models [2.569647910019739]
This work implements and compares adversarial and non-adversarial training of recurrent neural network music composers on MIDI data.
The evaluation indicates that adversarial training produces more aesthetically pleasing music.
arXiv Detail & Related papers (2022-11-01T20:23:49Z) - Supervised and Unsupervised Learning of Audio Representations for Music
Understanding [9.239657838690226]
We show how the domain of pre-training datasets affects the adequacy of the resulting audio embeddings for downstream tasks.
We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-10-07T20:07:35Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.