Count The Notes: Histogram-Based Supervision for Automatic Music Transcription
- URL: http://arxiv.org/abs/2511.14250v1
- Date: Tue, 18 Nov 2025 08:40:05 GMT
- Title: Count The Notes: Histogram-Based Supervision for Automatic Music Transcription
- Authors: Jonathan Yaffe, Ben Maman, Meinard Müller, Amit H. Bermano,
- Abstract summary: We introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment.<n>CountEM iteratively refines predictions based solely on note occurrence counts, significantly reducing annotation efforts.<n> Experiments on piano, guitar, and multi-instrument datasets demonstrate that CountEM matches or surpasses existing weakly supervised methods.
- Score: 18.777036975452422
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM iteratively refines predictions based solely on note occurrence counts, significantly reducing annotation efforts while maintaining high transcription accuracy. Experiments on piano, guitar, and multi-instrument datasets demonstrate that CountEM matches or surpasses existing weakly supervised methods, improving AMT's robustness, scalability, and efficiency. Our project page is available at https://yoni-yaffe.github.io/count-the-notes.
Related papers
- Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning [69.78158549955384]
We introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions.<n>This approach generates verifiable sheet music questions in both textual and visual modalities.<n> Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music.
arXiv Detail & Related papers (2025-09-04T09:42:17Z) - End-to-end Piano Performance-MIDI to Score Conversion with Transformers [26.900974153235456]
We present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files.
We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data.
Our method is also the first to directly predict notational details like trill marks or stem direction from performance data.
arXiv Detail & Related papers (2024-09-30T20:11:37Z) - Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion [0.0]
We propose a transcription model that does not require any MIDI-audio paired data for pre-training and adversarial domain confusion.
In experiments, we evaluate methods under the real-world application scenario where training datasets do not include the MIDI annotation of audio.
Our proposed method achieved competitive performance relative to established baseline methods, despite not utilizing any real datasets of paired MIDI-audio.
arXiv Detail & Related papers (2023-12-16T10:07:18Z) - Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music
Transcription [19.228155694144995]
Timbre-Trap is a novel framework which unifies music transcription and audio reconstruction.
We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients.
We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods.
arXiv Detail & Related papers (2023-09-27T15:19:05Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - FretNet: Continuous-Valued Pitch Contour Streaming for Polyphonic Guitar
Tablature Transcription [0.34376560669160383]
In certain applications, such as Guitar Tablature Transcription (GTT), it is more meaningful to estimate continuous-valued pitch contours.
We present a GTT formulation that estimates continuous-valued pitch contours, grouping them according to their string and fret of origin.
We demonstrate that for this task, the proposed method significantly improves the resolution of MPE and simultaneously yields tablature estimation results competitive with baseline models.
arXiv Detail & Related papers (2022-12-06T14:51:27Z) - Unaligned Supervision For Automatic Music Transcription in The Wild [1.2183405753834562]
NoteEM is a method for simultaneously training a transcriber and aligning the scores to their corresponding performances.
We report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations.
arXiv Detail & Related papers (2022-04-28T17:31:43Z) - MT3: Multi-Task Multitrack Music Transcription [7.5947187537718905]
We show that a general-purpose Transformer model can perform multi-task Automatic Music Transcription (AMT)
We show this unified training framework achieves high-quality transcription results across a range of datasets.
arXiv Detail & Related papers (2021-11-04T17:19:39Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Temporal Calibrated Regularization for Robust Noisy Label Learning [60.90967240168525]
Deep neural networks (DNNs) exhibit great success on many tasks with the help of large-scale well annotated datasets.
However, labeling large-scale data can be very costly and error-prone so that it is difficult to guarantee the annotation quality.
We propose a Temporal Calibrated Regularization (TCR) in which we utilize the original labels and the predictions in the previous epoch together.
arXiv Detail & Related papers (2020-07-01T04:48:49Z) - Hard Non-Monotonic Attention for Character-Level Transduction [65.17388794270694]
We introduce an exact, exponential-time algorithm for marginalizing over a number of non-monotonic alignments between two strings.
We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the approximation and outperforms soft attention.
arXiv Detail & Related papers (2018-08-29T20:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.