Related papers: Audio-to-Score Conversion Model Based on Whisper methodology

Audio-to-Score Conversion Model Based on Whisper methodology

URL: http://arxiv.org/abs/2410.17209v1
Date: Tue, 22 Oct 2024 17:31:37 GMT
Title: Audio-to-Score Conversion Model Based on Whisper methodology
Authors: Hongyao Zhang, Bohang Sun,
Abstract summary: This thesis innovatively introduces the "Orpheus' Score", a custom notation system that converts music information into tokens. Experiments show that compared to traditional algorithms, the model has significantly improved accuracy and performance.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This thesis develops a Transformer model based on Whisper, which extracts melodies and chords from music audio and records them into ABC notation. A comprehensive data processing workflow is customized for ABC notation, including data cleansing, formatting, and conversion, and a mutation mechanism is implemented to increase the diversity and quality of training data. This thesis innovatively introduces the "Orpheus' Score", a custom notation system that converts music information into tokens, designs a custom vocabulary library, and trains a corresponding custom tokenizer. Experiments show that compared to traditional algorithms, the model has significantly improved accuracy and performance. While providing a convenient audio-to-score tool for music enthusiasts, this work also provides new ideas and tools for research in music information processing.

Related papers

EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z)
Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation [49.062766449989525]
Generative models of music audio are typically used to generate output based solely on a text prompt or melody.<n>Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model.
arXiv Detail & Related papers (2025-07-07T10:46:07Z)
Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment [0.0]
Music102 enhances chord progression accompaniment through a D12-equivariant transformer. By encoding prior music knowledge, the model maintains equivariance across both melody and chord sequences. This work showcases the adaptability of self-attention mechanisms and layer normalization to the discrete musical domain.
arXiv Detail & Related papers (2024-10-23T03:11:01Z)
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning [24.6866990804501]
Instruct-MusicGen is a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps.
arXiv Detail & Related papers (2024-05-28T17:27:20Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music. We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks'' GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time. Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z)
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types. We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input. In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
Transfer of knowledge among instruments in automatic music transcription [2.0305676256390934]
This work shows how to employ easily generated synthesized audio data produced by software synthesizers to train a universal model. It is a good base for further transfer learning to quickly adapt transcription model for other instruments.
arXiv Detail & Related papers (2023-04-30T08:37:41Z)
Melody transcription via generative pre-training [86.08508957229348]
Key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio. We derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music.
arXiv Detail & Related papers (2022-12-04T18:09:23Z)
Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions. We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation. Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)
MusCaps: Generating Captions for Music Audio [14.335950077921435]
We present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention. Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs. Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding.
arXiv Detail & Related papers (2021-04-24T16:34:47Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
Learning Style-Aware Symbolic Music Representations by Adversarial Autoencoders [9.923470453197657]
We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information. We introduce the first Music Adversarial Autoencoder (MusAE) Our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders.
arXiv Detail & Related papers (2020-01-15T18:07:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.