Just Label the Repeats for In-The-Wild Audio-to-Score Alignment
- URL: http://arxiv.org/abs/2411.07428v1
- Date: Mon, 11 Nov 2024 23:05:02 GMT
- Title: Just Label the Repeats for In-The-Wild Audio-to-Score Alignment
- Authors: Irmak Bukey, Michael Feffer, Chris Donahue,
- Abstract summary: We propose an efficient workflow for alignment of in-the-wild performance audio and corresponding sheet music scans (images)
We show that our proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work.
- Score: 7.7805314458791806
- License:
- Abstract: We propose an efficient workflow for high-quality offline alignment of in-the-wild performance audio and corresponding sheet music scans (images). Recent work on audio-to-score alignment extends dynamic time warping (DTW) to be theoretically able to handle jumps in sheet music induced by repeat signs-this method requires no human annotations, but we show that it often yields low-quality alignments. As an alternative, we propose a workflow and interface that allows users to quickly annotate jumps (by clicking on repeat signs), requiring a small amount of human supervision but yielding much higher quality alignments on average. Additionally, we refine audio and score feature representations to improve alignment quality by: (1) integrating measure detection into the score feature representation, and (2) using raw onset prediction probabilities from a music transcription model instead of piano roll. We propose an evaluation protocol for audio-to-score alignment that computes the distance between the estimated and ground truth alignment in units of measures. Under this evaluation, we find that our proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work (33% to 82%).
Related papers
- Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval [18.333752341467083]
The biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries.
This work proposes an approximation to cross-attention scoring based on vector quantization.
We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries.
arXiv Detail & Related papers (2024-11-01T15:28:03Z) - Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval [3.5570874721859016]
We propose a two-staged training procedure in which multiple retrieval models are first trained without estimated correspondences.
In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets.
We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting.
arXiv Detail & Related papers (2024-08-21T14:10:58Z) - Online Symbolic Music Alignment with Offline Reinforcement Learning [0.0]
Symbolic Music Alignment is the process of matching performed MIDI notes to corresponding score notes.
In this paper, we introduce a reinforcement learning-based online symbolic music alignment technique.
The proposed model outperforms a state-of-the-art reference model of offline symbolic music alignment.
arXiv Detail & Related papers (2023-12-31T11:42:42Z) - STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks.
This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations.
We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation [80.12316877964558]
High-quality data labeling from specific domains is costly and human time-consuming.
We propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm.
arXiv Detail & Related papers (2022-10-27T07:23:08Z) - Unaligned Supervision For Automatic Music Transcription in The Wild [1.2183405753834562]
NoteEM is a method for simultaneously training a transcriber and aligning the scores to their corresponding performances.
We report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations.
arXiv Detail & Related papers (2022-04-28T17:31:43Z) - Cross-domain Speech Recognition with Unsupervised Character-level
Distribution Matching [60.8427677151492]
We propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains.
Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR.
arXiv Detail & Related papers (2021-04-15T14:36:54Z) - Learning Frame Similarity using Siamese networks for Audio-to-Score
Alignment [13.269759433551478]
We propose a method to overcome the limitation using learned frame similarity for audio-to-score alignment.
We focus on offline audio-to-score alignment of piano music.
arXiv Detail & Related papers (2020-11-15T14:58:03Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.