Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval
- URL: http://arxiv.org/abs/2309.12158v1
- Date: Thu, 21 Sep 2023 15:11:16 GMT
- Title: Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval
- Authors: Luis Carvalho and Gerhard Widmer
- Abstract summary: Cross-modal deep learning is used to learn joint embedding spaces that link the two distinct modalities - audio and sheet music images.
While there has been steady improvement on this front over the past years, a number of open problems still prevent large-scale employment of this methodology.
We identify a set of main challenges on the road towards robust and large-scale cross-modal music retrieval in real scenarios.
- Score: 4.722882736419499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A range of applications of multi-modal music information retrieval is centred
around the problem of connecting large collections of sheet music (images) to
corresponding audio recordings, that is, identifying pairs of audio and score
excerpts that refer to the same musical content. One of the typical and most
recent approaches to this task employs cross-modal deep learning architectures
to learn joint embedding spaces that link the two distinct modalities - audio
and sheet music images. While there has been steady improvement on this front
over the past years, a number of open problems still prevent large-scale
employment of this methodology. In this article we attempt to provide an
insightful examination of the current developments on audio-sheet music
retrieval via deep learning methods. We first identify a set of main challenges
on the road towards robust and large-scale cross-modal music retrieval in real
scenarios. We then highlight the steps we have taken so far to address some of
these challenges, documenting step-by-step improvement along several
dimensions. We conclude by analysing the remaining challenges and present ideas
for solving these, in order to pave the way to a unified and robust methodology
for cross-modal music retrieval.
Related papers
- MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark.
It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events.
Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z) - Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement [10.714947060480426]
We propose a unified sequence-to-sequence framework that enables the fine-tuning of a symbolic music language model.
Our experiments demonstrate that the proposed approach consistently achieves higher musical quality compared to task-specific baselines.
arXiv Detail & Related papers (2024-08-27T16:18:51Z) - Towards Explainable and Interpretable Musical Difficulty Estimation: A Parameter-efficient Approach [49.2787113554916]
Estimating music piece difficulty is important for organizing educational music collections.
Our work employs explainable descriptors for difficulty estimation in symbolic music representations.
Our approach, evaluated in piano repertoire categorized in 9 classes, achieved 41.4% accuracy independently, with a mean squared error (MSE) of 1.7.
arXiv Detail & Related papers (2024-08-01T11:23:42Z) - LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation [49.89372182441713]
We introduce LARP, a multi-modal cold-start playlist continuation model.
Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss.
arXiv Detail & Related papers (2024-06-20T14:02:15Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks.
This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations.
We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z) - Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content.
We employ the snippet embeddings in the higher-level task of cross-modal piece identification.
In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z) - Passage Summarization with Recurrent Models for Audio-Sheet Music
Retrieval [4.722882736419499]
Cross-modal music retrieval can connect sheet music images to audio recordings.
We propose a cross-modal recurrent network that learns joint embeddings to summarize longer passages of corresponding audio and sheet music.
We conduct a number of experiments on synthetic and real piano data and scores, showing that our proposed recurrent method leads to more accurate retrieval in all possible configurations.
arXiv Detail & Related papers (2023-09-21T14:30:02Z) - AutoMatch: A Large-scale Audio Beat Matching Benchmark for Boosting Deep
Learning Assistant Video Editing [7.672758847025309]
Short video resources can not be independent of the valuable editing work contributed by numerous video creators.
In this paper, we investigate audio beat matching (ABM), which aims to recommend the proper transition time stamps based on the background music.
This technique helps to ease the labor-intensive work during video editing, saving energy for creators so that they can focus more on the creativity of video content.
arXiv Detail & Related papers (2023-03-03T12:30:09Z) - Late multimodal fusion for image and audio music transcription [0.0]
multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities.
We study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems.
Two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.
arXiv Detail & Related papers (2022-04-06T20:00:33Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.