LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR
- URL: http://arxiv.org/abs/2506.19065v1
- Date: Mon, 23 Jun 2025 19:35:59 GMT
- Title: LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR
- Authors: Guang Yang, Victoria Ebert, Nazif Tamer, Luiza Pozzobon, Noah A. Smith,
- Abstract summary: Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores.<n>Our model exhibits the strong ability to generalize across various typeset scores.
- Score: 44.85037245145321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Legato, a new end-to-end transformer model for optical music recognition (OMR). Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct experiments on a range of datasets and demonstrate that our model achieves state-of-the-art performance. Given the lack of a standardized evaluation for end-to-end OMR, we comprehensively compare our model against the previous state of the art using a diverse set of metrics.
Related papers
- Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z) - Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z) - Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats.
One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image.
We introduce a music object detector based on YOLOv8, which improves detection performance.
Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z) - End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music [12.779526750915707]
We present the first truly end-to-end approach for page-level Optical Music Recognition.
Our system processes an entire music score page and outputs a complete transcription in a music encoding format.
The results demonstrate that our system not only successfully transcribes full-page music scores but also outperforms the commercial tool in both zero-shot settings and after fine-tuning with the target domain.
arXiv Detail & Related papers (2024-05-20T15:21:48Z) - Practical End-to-End Optical Music Recognition for Pianoform Music [3.69298824193862]
We define a sequential format called Linearized MusicXML, allowing to train an end-to-end model directly.
We create a benchmarking typeset OMR with MusicXML ground truth based on the OpenScore Lieder corpus.
We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model.
arXiv Detail & Related papers (2024-03-20T17:26:22Z) - A Unified Representation Framework for the Evaluation of Optical Music Recognition Systems [4.936226952764696]
We identify the need for a common music representation language and propose the Music Tree Notation (MTN) format.
This format represents music as a set of primitives that group together into higher-abstraction nodes.
We have also developed a specific set of OMR metrics and a typeset score dataset as a proof of concept of this idea.
arXiv Detail & Related papers (2023-12-20T10:45:22Z) - Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.