Toward a More Complete OMR Solution
- URL: http://arxiv.org/abs/2409.00316v1
- Date: Sat, 31 Aug 2024 01:09:12 GMT
- Title: Toward a More Complete OMR Solution
- Authors: Guang Yang, Muru Zhang, Lin Qiu, Yanming Wan, Noah A. Smith,
- Abstract summary: Optical music recognition aims to convert music notation into digital formats.
One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image.
We introduce a music object detector based on YOLOv8, which improves detection performance.
Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
- Score: 49.74172035862698
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.
Related papers
- Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation [0.0]
Optical Music Recognition (OMR) automates the transcription of musical notation from images into machine-readable formats like MusicXML, MEI, or MIDI.
This study explores knowledge discovery in OMR by applying instance segmentation using Mask R-CNN to enhance the detection and delineation of musical symbols in sheet music.
arXiv Detail & Related papers (2024-08-27T12:34:41Z) - Cue Point Estimation using Object Detection [20.706469085872516]
Cue points indicate possible temporal boundaries in a transition between two pieces of music in DJ mixing.
We present a novel method for automatic cue point estimation, interpreted as a computer vision object detection task.
Our proposed system is based on a pre-trained object detection transformer which we fine-tune on our novel cue point dataset.
arXiv Detail & Related papers (2024-07-09T12:56:30Z) - Practical End-to-End Optical Music Recognition for Pianoform Music [3.69298824193862]
We define a sequential format called Linearized MusicXML, allowing to train an end-to-end model directly.
We create a benchmarking typeset OMR with MusicXML ground truth based on the OpenScore Lieder corpus.
We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model.
arXiv Detail & Related papers (2024-03-20T17:26:22Z) - A Unified Representation Framework for the Evaluation of Optical Music Recognition Systems [4.936226952764696]
We identify the need for a common music representation language and propose the Music Tree Notation (MTN) format.
This format represents music as a set of primitives that group together into higher-abstraction nodes.
We have also developed a specific set of OMR metrics and a typeset score dataset as a proof of concept of this idea.
arXiv Detail & Related papers (2023-12-20T10:45:22Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - 1st Place Solution to ECCV-TAO-2020: Detect and Represent Any Object for
Tracking [19.15537335764895]
We extend the classical tracking-by-detection paradigm to this tracking-any-object task.
We learn appearance features to represent any object by training feature learning networks.
Simple linking strategies with most similar appearance features and tracklet-level post association module are finally applied to generate final tracking results.
arXiv Detail & Related papers (2021-01-20T09:42:32Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.