Related papers: Unaligned Supervision For Automatic Music Transcription in The Wild

Unaligned Supervision For Automatic Music Transcription in The Wild

URL: http://arxiv.org/abs/2204.13668v1
Date: Thu, 28 Apr 2022 17:31:43 GMT
Title: Unaligned Supervision For Automatic Music Transcription in The Wild
Authors: Ben Maman and Amit H. Bermano
Abstract summary: NoteEM is a method for simultaneously training a transcriber and aligning the scores to their corresponding performances. We report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations.
Score: 1.2183405753834562
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multi-instrument Automatic Music Transcription (AMT), or the decoding of a musical recording into semantic musical content, is one of the holy grails of Music Information Retrieval. Current AMT approaches are restricted to piano and (some) guitar recordings, due to difficult data collection. In order to overcome data collection barriers, previous AMT approaches attempt to employ musical scores in the form of a digitized version of the same song or piece. The scores are typically aligned using audio features and strenuous human intervention to generate training labels. We introduce NoteEM, a method for simultaneously training a transcriber and aligning the scores to their corresponding performances, in a fully-automated process. Using this unaligned supervision scheme, complemented by pseudo-labels and pitch-shift augmentation, our method enables training on in-the-wild recordings with unprecedented accuracy and instrumental variety. Using only synthetic data and unaligned supervision, we report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations. We also demonstrate robustness and ease of use; we report comparable results when training on a small, easily obtainable, self-collected dataset, and we propose alternative labeling to the MusicNet dataset, which we show to be more accurate. Our project page is available at https://benadar293.github.io

Related papers

End-to-end Piano Performance-MIDI to Score Conversion with Transformers [26.900974153235456]
We present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files. We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data. Our method is also the first to directly predict notational details like trill marks or stem direction from performance data.
arXiv Detail & Related papers (2024-09-30T20:11:37Z)
Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image. We introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z)
Development of Large Annotated Music Datasets using HMM-based Forced Viterbi Alignment [0.0]
We propose a well streamlined and efficient method for generating datasets for any instrument. The onsets of the transcriptions are manually verified and the labels are accurate up to 10ms, averaging at 5ms. This method will aid as a preliminary step towards building concrete datasets for building AMT systems for different instruments.
arXiv Detail & Related papers (2024-08-27T09:06:29Z)
YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation [15.9795868183084]
Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors.
arXiv Detail & Related papers (2024-07-05T19:18:33Z)
Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion [0.0]
We propose a transcription model that does not require any MIDI-audio paired data for pre-training and adversarial domain confusion. In experiments, we evaluate methods under the real-world application scenario where training datasets do not include the MIDI annotation of audio. Our proposed method achieved competitive performance relative to established baseline methods, despite not utilizing any real datasets of paired MIDI-audio.
arXiv Detail & Related papers (2023-12-16T10:07:18Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types. We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input. In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
Melody transcription via generative pre-training [86.08508957229348]
Key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio. We derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music.
arXiv Detail & Related papers (2022-12-04T18:09:23Z)
Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions. We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation. Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.