Related papers: LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

URL: http://arxiv.org/abs/2506.19065v1
Date: Mon, 23 Jun 2025 19:35:59 GMT
Title: LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR
Authors: Guang Yang, Victoria Ebert, Nazif Tamer, Luiza Pozzobon, Noah A. Smith,
Abstract summary: Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores.<n>Our model exhibits the strong ability to generalize across various typeset scores.
Score: 44.85037245145321
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose Legato, a new end-to-end transformer model for optical music recognition (OMR). Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct experiments on a range of datasets and demonstrate that our model achieves state-of-the-art performance. Given the lack of a standardized evaluation for end-to-end OMR, we comprehensively compare our model against the previous state of the art using a diverse set of metrics.

Related papers

Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z)
Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification. The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders. During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z)
Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image. We introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z)
End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music [12.779526750915707]
We present the first truly end-to-end approach for page-level Optical Music Recognition. Our system processes an entire music score page and outputs a complete transcription in a music encoding format. The results demonstrate that our system not only successfully transcribes full-page music scores but also outperforms the commercial tool in both zero-shot settings and after fine-tuning with the target domain.
arXiv Detail & Related papers (2024-05-20T15:21:48Z)
Practical End-to-End Optical Music Recognition for Pianoform Music [3.69298824193862]
We define a sequential format called Linearized MusicXML, allowing to train an end-to-end model directly. We create a benchmarking typeset OMR with MusicXML ground truth based on the OpenScore Lieder corpus. We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model.
arXiv Detail & Related papers (2024-03-20T17:26:22Z)
PBSCR: The Piano Bootleg Score Composer Recognition Dataset [5.314803183185992]
PBSCR is a dataset for studying composer recognition of classical piano music. It contains 40,000 62x64 bootleg score images for a 9-class recognition task, 100,000 62x64 bootleg score images for a 100-class recognition task, and 29,310 unlabeled variable-length bootleg score images for pretraining.
arXiv Detail & Related papers (2024-01-30T07:50:32Z)
A Unified Representation Framework for the Evaluation of Optical Music Recognition Systems [4.936226952764696]
We identify the need for a common music representation language and propose the Music Tree Notation (MTN) format. This format represents music as a set of primitives that group together into higher-abstraction nodes. We have also developed a specific set of OMR metrics and a typeset score dataset as a proof of concept of this idea.
arXiv Detail & Related papers (2023-12-20T10:45:22Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM) By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters. Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.