Related papers: Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music

Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music

URL: http://arxiv.org/abs/2405.12105v2
Date: Tue, 21 May 2024 08:16:00 GMT
Title: Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music
Authors: Antonio Ríos-Vila, Jorge Calvo-Zaragoza, David Rizo, Thierry Paquet,
Abstract summary: Sheet Music Transformer++ is an end-to-end model that is able to transcribe full-page polyphonic music scores. We conduct several experiments on a full-page extension of a public polyphonic transcription dataset.
Score: 12.779526750915707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Optical Music Recognition is a field that has progressed significantly, bringing accurate systems that transcribe effectively music scores into digital formats. Despite this, there are still several limitations that hinder OMR from achieving its full potential. Specifically, state of the art OMR still depends on multi-stage pipelines for performing full-page transcription, as well as it has only been demonstrated in monophonic cases, leaving behind very relevant engravings. In this work, we present the Sheet Music Transformer++, an end-to-end model that is able to transcribe full-page polyphonic music scores without the need of a previous Layout Analysis step. This is done thanks to an extensive curriculum learning-based pretraining with synthetic data generation. We conduct several experiments on a full-page extension of a public polyphonic transcription dataset. The experimental outcomes confirm that the model is competent at transcribing full-page pianoform scores, marking a noteworthy milestone in end-to-end OMR transcription.

Related papers

Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z)
LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR [44.85037245145321]
Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores.<n>Our model exhibits the strong ability to generalize across various typeset scores.
arXiv Detail & Related papers (2025-06-23T19:35:59Z)
Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification. The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders. During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z)
Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image. We introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z)
End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding [4.604877755214193]
Existing end-to-end piano A2S systems have been trained and evaluated with only synthetic data. We propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores. We propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering system on synthetic audio, followed by fine-tuning the model using recordings of human performance.
arXiv Detail & Related papers (2024-05-22T10:52:04Z)
Practical End-to-End Optical Music Recognition for Pianoform Music [3.69298824193862]
We define a sequential format called Linearized MusicXML, allowing to train an end-to-end model directly. We create a benchmarking typeset OMR with MusicXML ground truth based on the OpenScore Lieder corpus. We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model.
arXiv Detail & Related papers (2024-03-20T17:26:22Z)
Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription [13.960714900433269]
Sheet Music Transformer is the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively.
arXiv Detail & Related papers (2024-02-12T11:52:21Z)
TrOMR:Transformer-Based Polyphonic Optical Music Recognition [26.14383240933706]
We propose a transformer-based approach with excellent global perceptual capability for end-to-end polyphonic OMR, called TrOMR. We also introduce a novel consistency loss function and a reasonable approach for data annotation to improve recognition accuracy for complex music scores.
arXiv Detail & Related papers (2023-08-18T08:06:27Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types. We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input. In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
Late multimodal fusion for image and audio music transcription [0.0]
multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities. We study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems. Two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.
arXiv Detail & Related papers (2022-04-06T20:00:33Z)
Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos. Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z)
An Empirical Evaluation of End-to-End Polyphonic Optical Music Recognition [24.377724078096144]
Piano and orchestral scores frequently exhibit polyphonic passages, which add a second dimension to the task. We propose two novel formulations for end-to-end polyphonic OMR. We observe a new state-of-the-art performance with our multi-sequence detection decoder, RNNDecoder.
arXiv Detail & Related papers (2021-08-03T22:04:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.