Multi-modal Conditional Bounding Box Regression for Music Score
Following
- URL: http://arxiv.org/abs/2105.04309v1
- Date: Mon, 10 May 2021 12:43:35 GMT
- Title: Multi-modal Conditional Bounding Box Regression for Music Score
Following
- Authors: Florian Henkel and Gerhard Widmer
- Abstract summary: This paper addresses the problem of sheet-image-based on-line audio-to-score alignment also known as score following.
A conditional neural network architecture is proposed that directly predicts x,y coordinates of the matching positions in a complete score sheet image at each point in time for a given musical performance.
- Score: 7.360807642941713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of sheet-image-based on-line audio-to-score
alignment also known as score following. Drawing inspiration from object
detection, a conditional neural network architecture is proposed that directly
predicts x,y coordinates of the matching positions in a complete score sheet
image at each point in time for a given musical performance. Experiments are
conducted on a synthetic polyphonic piano benchmark dataset and the new method
is compared to several existing approaches from the literature for
sheet-image-based score following as well as an Optical Music Recognition
baseline. The proposed approach achieves new state-of-the-art results and
furthermore significantly improves the alignment performance on a set of
real-world piano recordings by applying Impulse Responses as a data
augmentation technique.
Related papers
- End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding [4.604877755214193]
Existing end-to-end piano A2S systems have been trained and evaluated with only synthetic data.
We propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores.
We propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering system on synthetic audio, followed by fine-tuning the model using recordings of human performance.
arXiv Detail & Related papers (2024-05-22T10:52:04Z) - Online Symbolic Music Alignment with Offline Reinforcement Learning [0.0]
Symbolic Music Alignment is the process of matching performed MIDI notes to corresponding score notes.
In this paper, we introduce a reinforcement learning-based online symbolic music alignment technique.
The proposed model outperforms a state-of-the-art reference model of offline symbolic music alignment.
arXiv Detail & Related papers (2023-12-31T11:42:42Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Music Enhancement via Image Translation and Vocoding [14.356705444361832]
This paper presents a deep learning approach to enhance low-quality music recordings.
We combine an image-to-image translation model for manipulating audio in its mel-spectrogram representation and a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms.
We find that this approach to music enhancement outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms.
arXiv Detail & Related papers (2022-04-28T05:00:07Z) - Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space.
We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z) - Contextual Similarity Aggregation with Self-attention for Visual
Re-ranking [96.55393026011811]
We propose a visual re-ranking method by contextual similarity aggregation with self-attention.
We conduct comprehensive experiments on four benchmark datasets to demonstrate the generality and effectiveness of our proposed visual re-ranking method.
arXiv Detail & Related papers (2021-10-26T06:20:31Z) - Structure-Aware Audio-to-Score Alignment using Progressively Dilated
Convolutional Neural Networks [8.669338893753885]
The identification of structural differences between a music performance and the score is a challenging yet integral step of audio-to-score alignment.
We present a novel method to detect such differences using progressively dilated convolutional neural networks.
arXiv Detail & Related papers (2021-01-31T05:14:58Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z) - Understanding Integrated Gradients with SmoothTaylor for Deep Neural
Network Attribution [70.78655569298923]
Integrated Gradients as an attribution method for deep neural network models offers simple implementability.
It suffers from noisiness of explanations which affects the ease of interpretability.
The SmoothGrad technique is proposed to solve the noisiness issue and smoothen the attribution maps of any gradient-based attribution method.
arXiv Detail & Related papers (2020-04-22T10:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.