Context-aware Automatic Music Transcription
- URL: http://arxiv.org/abs/2203.16294v1
- Date: Wed, 30 Mar 2022 13:36:17 GMT
- Title: Context-aware Automatic Music Transcription
- Authors: Federico Simonetta, Stavros Ntalampiras, Federico Avanzini
- Abstract summary: This paper presents an Automatic Music Transcription system that incorporates context-related information.
Motivated by the state-of-art psychological research, we propose a methodology boosting the accuracy of AMT systems.
- Score: 10.957528713294874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an Automatic Music Transcription system that incorporates
context-related information. Motivated by the state-of-art psychological
research, we propose a methodology boosting the accuracy of AMT systems by
modeling the adaptations that performers apply to successfully convey their
interpretation in any acoustical context. In this work, we show that exploiting
the knowledge of the source acoustical context allows reducing the error
related to the inference of MIDI velocity. The proposed model structure first
extracts the interpretation features and then applies the modeled performer
adaptations. Interestingly, such a methodology is extensible in a
straightforward way since only slight efforts are required to train completely
context-aware AMT models.
Related papers
- Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation [3.8570045844185237]
We present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset.
Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems.
We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix.
arXiv Detail & Related papers (2024-08-05T14:34:40Z) - Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey [2.4895506645605123]
This systematic review accentuates the pivotal role of Automatic Music Transcription (AMT) in music signal analysis.
Despite notable advancements, AMT systems have yet to match the accuracy of human experts.
By addressing the limitations of prior techniques and suggesting avenues for improvement, our objective is to steer future research towards fully automated AMT systems.
arXiv Detail & Related papers (2024-06-20T03:48:15Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - A Perceptual Measure for Evaluating the Resynthesis of Automatic Music
Transcriptions [10.957528713294874]
This study focuses on the perception of music performances when contextual factors, such as room acoustics and instrument, change.
We propose to distinguish the concept of "performance" from the one of "interpretation", which expresses the "artistic intention"
arXiv Detail & Related papers (2022-02-24T18:09:22Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - A Light-weight contextual spelling correction model for customizing
transducer-based speech recognition systems [42.05399301143457]
We introduce a light-weight contextual spelling correction model to correct context-related recognition errors.
Experiments show that the model improves baseline ASR model performance with about 50% relative word error rate reduction.
The model also shows excellent performance for out-of-vocabulary terms not seen during training.
arXiv Detail & Related papers (2021-08-17T08:14:37Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z) - Hybrid Autoregressive Transducer (hat) [11.70833387055716]
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model.
It is a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems.
We evaluate our proposed model on a large-scale voice search task.
arXiv Detail & Related papers (2020-03-12T20:47:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.