Related papers: Automatic Music Transcription using Convolutional Neural Networks and Constant-Q transform

Automatic Music Transcription using Convolutional Neural Networks and Constant-Q transform

URL: http://arxiv.org/abs/2505.04451v1
Date: Wed, 07 May 2025 14:20:43 GMT
Title: Automatic Music Transcription using Convolutional Neural Networks and Constant-Q transform
Authors: Yohannis Telila, Tommaso Cucinotta, Davide Bacciu,
Abstract summary: We design a processing pipeline that can transform classical piano audio files in.wav format into a music score representation.<n>The features from the audio signals are extracted using the constant-Q transform, and the resulting coefficients are used as an input to the convolutional neural network (CNN) model.
Score: 14.72084645157747
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic music transcription (AMT) is the problem of analyzing an audio recording of a musical piece and detecting notes that are being played. AMT is a challenging problem, particularly when it comes to polyphonic music. The goal of AMT is to produce a score representation of a music piece, by analyzing a sound signal containing multiple notes played simultaneously. In this work, we design a processing pipeline that can transform classical piano audio files in .wav format into a music score representation. The features from the audio signals are extracted using the constant-Q transform, and the resulting coefficients are used as an input to the convolutional neural network (CNN) model.

Related papers

RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection [17.45655063331199]
RUMAA is a transformer-based framework for music performance analysis.<n>It unifies score-to-performance alignment, score-informed transcription, and mistake detection in a near end-to-end manner.
arXiv Detail & Related papers (2025-07-16T12:13:13Z)
Source Separation & Automatic Transcription for Music [0.0]
We create an end-to-end pipeline that allows for an initial music audio mixture to be separated into instrument stems, converted into MIDI files, and transcribed into sheet music for each component instrument.
arXiv Detail & Related papers (2024-12-09T17:49:14Z)
YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation [15.9795868183084]
Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors.
arXiv Detail & Related papers (2024-07-05T19:18:33Z)
Multitrack Music Transcription with a Time-Frequency Perceiver [6.617487928813374]
Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously. We propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription.
arXiv Detail & Related papers (2023-06-19T08:58:26Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types. We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input. In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
Melody transcription via generative pre-training [86.08508957229348]
Key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio. We derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music.
arXiv Detail & Related papers (2022-12-04T18:09:23Z)
Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model. A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens. Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z)
Polyphonic pitch detection with convolutional recurrent neural networks [0.0]
In this work, we outline an online polyphonic pitch detection system that streams audio to MIDI by ConvLSTMs. Our system achieves state-of-the-art results on the 2007 MIREX multi-F0 development set, with an F-measure of 83% on the bassoon, clarinet, flute, horn and oboe ensemble recording.
arXiv Detail & Related papers (2022-02-04T12:58:02Z)
DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders. As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.