MusicRL: Aligning Music Generation to Human Preferences
- URL: http://arxiv.org/abs/2402.04229v1
- Date: Tue, 6 Feb 2024 18:36:52 GMT
- Title: MusicRL: Aligning Music Generation to Human Preferences
- Authors: Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej
Kastelic, Zal\'an Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem,
Olivier Pietquin, Matthieu Geist, L\'eonard Hussenot, Neil Zeghidour and
Andrea Agostinelli
- Abstract summary: MusicRL is the first music generation system finetuned from human feedback.
We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences.
We train MusicRL-U, the first text-to-music model that incorporates human feedback at scale.
- Score: 62.44903326718772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose MusicRL, the first music generation system finetuned from human
feedback. Appreciation of text-to-music models is particularly subjective since
the concept of musicality as well as the specific intention behind a caption
are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a
retro guitar solo or a techno pop beat). Not only this makes supervised
training of such models challenging, but it also calls for integrating
continuous human feedback in their post-deployment finetuning. MusicRL is a
pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete
audio tokens finetuned with reinforcement learning to maximise sequence-level
rewards. We design reward functions related specifically to text-adherence and
audio quality with the help from selected raters, and use those to finetune
MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial
dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning
from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model
that incorporates human feedback at scale. Human evaluations show that both
MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU
combines the two approaches and results in the best model according to human
raters. Ablation studies shed light on the musical attributes influencing human
preferences, indicating that text adherence and quality only account for a part
of it. This underscores the prevalence of subjectivity in musical appreciation
and calls for further involvement of human listeners in the finetuning of music
generation models.
Related papers
- Adversarial-MidiBERT: Symbolic Music Understanding Model Based on Unbias Pre-training and Mask Fine-tuning [2.61072980439312]
We propose Adrial-MidiBERT, a symbolic music understanding model based on Biversa Representations from Transformers (BERT)
We introduce an unbiased pre-training method based on adversarial learning to minimize the participation of tokens that lead to biases during training. Furthermore, we propose a mask fine-tuning method to narrow the data gap between pre-training and fine-tuning, which can help the model converge faster and perform better.
arXiv Detail & Related papers (2024-07-11T08:54:38Z) - MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens.
We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Exploring Musical Roots: Applying Audio Embeddings to Empower Influence
Attribution for a Generative Music Model [6.476298483207895]
We develop a methodology to identify similar pieces of music audio in a manner that is useful for understanding training data attribution.
We compare the effect of applying CLMR and CLAP embeddings to similarity measurement in a set of 5 million audio clips used to train VampNet.
This work is to incorporate automated influence attribution into generative modeling, which promises to let model creators and users move from ignorant appropriation to informed creation.
arXiv Detail & Related papers (2024-01-25T22:20:42Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - Flat latent manifolds for music improvisation between human and machine [9.571383193449648]
We consider a music-generating algorithm as a counterpart to a human musician, in a setting where reciprocal improvisation is to lead to new experiences.
In the learned model, we generate novel musical sequences by quantification in latent space.
We provide empirical evidence for our method via a set of experiments on music and we deploy our model for an interactive jam session with a professional drummer.
arXiv Detail & Related papers (2022-02-23T09:00:17Z) - RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement
Learning [69.20460466735852]
This paper presents a deep reinforcement learning algorithm for online accompaniment generation.
The proposed algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part.
arXiv Detail & Related papers (2020-02-08T03:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.