Unaligned Supervision For Automatic Music Transcription in The Wild
- URL: http://arxiv.org/abs/2204.13668v1
- Date: Thu, 28 Apr 2022 17:31:43 GMT
- Title: Unaligned Supervision For Automatic Music Transcription in The Wild
- Authors: Ben Maman and Amit H. Bermano
- Abstract summary: NoteEM is a method for simultaneously training a transcriber and aligning the scores to their corresponding performances.
We report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations.
- Score: 1.2183405753834562
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-instrument Automatic Music Transcription (AMT), or the decoding of a
musical recording into semantic musical content, is one of the holy grails of
Music Information Retrieval. Current AMT approaches are restricted to piano and
(some) guitar recordings, due to difficult data collection. In order to
overcome data collection barriers, previous AMT approaches attempt to employ
musical scores in the form of a digitized version of the same song or piece.
The scores are typically aligned using audio features and strenuous human
intervention to generate training labels. We introduce NoteEM, a method for
simultaneously training a transcriber and aligning the scores to their
corresponding performances, in a fully-automated process. Using this unaligned
supervision scheme, complemented by pseudo-labels and pitch-shift augmentation,
our method enables training on in-the-wild recordings with unprecedented
accuracy and instrumental variety. Using only synthetic data and unaligned
supervision, we report SOTA note-level accuracy of the MAPS dataset, and large
favorable margins on cross-dataset evaluations. We also demonstrate robustness
and ease of use; we report comparable results when training on a small, easily
obtainable, self-collected dataset, and we propose alternative labeling to the
MusicNet dataset, which we show to be more accurate. Our project page is
available at https://benadar293.github.io
Related papers
- End-to-end Piano Performance-MIDI to Score Conversion with Transformers [26.900974153235456]
We present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files.
We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data.
Our method is also the first to directly predict notational details like trill marks or stem direction from performance data.
arXiv Detail & Related papers (2024-09-30T20:11:37Z) - Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats.
One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image.
We introduce a music object detector based on YOLOv8, which improves detection performance.
Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z) - Development of Large Annotated Music Datasets using HMM-based Forced Viterbi Alignment [0.0]
We propose a well streamlined and efficient method for generating datasets for any instrument.
The onsets of the transcriptions are manually verified and the labels are accurate up to 10ms, averaging at 5ms.
This method will aid as a preliminary step towards building concrete datasets for building AMT systems for different instruments.
arXiv Detail & Related papers (2024-08-27T09:06:29Z) - YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation [15.9795868183084]
Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument.
This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription.
Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors.
arXiv Detail & Related papers (2024-07-05T19:18:33Z) - Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion [0.0]
We propose a transcription model that does not require any MIDI-audio paired data for pre-training and adversarial domain confusion.
In experiments, we evaluate methods under the real-world application scenario where training datasets do not include the MIDI annotation of audio.
Our proposed method achieved competitive performance relative to established baseline methods, despite not utilizing any real datasets of paired MIDI-audio.
arXiv Detail & Related papers (2023-12-16T10:07:18Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - Melody transcription via generative pre-training [86.08508957229348]
Key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles.
To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio.
We derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music.
arXiv Detail & Related papers (2022-12-04T18:09:23Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.