An Empirical Evaluation of End-to-End Polyphonic Optical Music
Recognition
- URL: http://arxiv.org/abs/2108.01769v1
- Date: Tue, 3 Aug 2021 22:04:40 GMT
- Title: An Empirical Evaluation of End-to-End Polyphonic Optical Music
Recognition
- Authors: Sachinda Edirisooriya, Hao-Wen Dong, Julian McAuley, Taylor
Berg-Kirkpatrick
- Abstract summary: Piano and orchestral scores frequently exhibit polyphonic passages, which add a second dimension to the task.
We propose two novel formulations for end-to-end polyphonic OMR.
We observe a new state-of-the-art performance with our multi-sequence detection decoder, RNNDecoder.
- Score: 24.377724078096144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous work has shown that neural architectures are able to perform optical
music recognition (OMR) on monophonic and homophonic music with high accuracy.
However, piano and orchestral scores frequently exhibit polyphonic passages,
which add a second dimension to the task. Monophonic and homophonic music can
be described as homorhythmic, or having a single musical rhythm. Polyphonic
music, on the other hand, can be seen as having multiple rhythmic sequences, or
voices, concurrently. We first introduce a workflow for creating large-scale
polyphonic datasets suitable for end-to-end recognition from sheet music
publicly available on the MuseScore forum. We then propose two novel
formulations for end-to-end polyphonic OMR -- one treating the problem as a
type of multi-task binary classification, and the other treating it as
multi-sequence detection. Building upon the encoder-decoder architecture and an
image encoder proposed in past work on end-to-end OMR, we propose two novel
decoder models -- FlagDecoder and RNNDecoder -- that correspond to the two
formulations. Finally, we compare the empirical performance of these end-to-end
approaches to polyphonic OMR and observe a new state-of-the-art performance
with our multi-sequence detection decoder, RNNDecoder.
Related papers
- PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations [0.3683202928838613]
Cadenza is a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas.
The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer.
Our framework is designed, researched and implemented with the objective of providing inspiration for musicians.
arXiv Detail & Related papers (2024-10-02T22:11:31Z) - Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats.
One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image.
We introduce a music object detector based on YOLOv8, which improves detection performance.
Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z) - End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding [4.604877755214193]
Existing end-to-end piano A2S systems have been trained and evaluated with only synthetic data.
We propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores.
We propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering system on synthetic audio, followed by fine-tuning the model using recordings of human performance.
arXiv Detail & Related papers (2024-05-22T10:52:04Z) - Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription [13.960714900433269]
Sheet Music Transformer is the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies.
Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively.
arXiv Detail & Related papers (2024-02-12T11:52:21Z) - Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long
Multi-track Symbolic Music Generation [50.365392018302416]
We propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music.
We focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy.
arXiv Detail & Related papers (2024-01-15T08:41:01Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model.
A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens.
Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z) - Late multimodal fusion for image and audio music transcription [0.0]
multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities.
We study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems.
Two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.
arXiv Detail & Related papers (2022-04-06T20:00:33Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.