Benchmarking Music Generation Models and Metrics via Human Preference Studies
- URL: http://arxiv.org/abs/2506.19085v1
- Date: Mon, 23 Jun 2025 20:01:29 GMT
- Title: Benchmarking Music Generation Models and Metrics via Human Preference Studies
- Authors: Florian Grötschla, Ahmet Solak, Luca A. Lanzendörfer, Roger Wattenhofer,
- Abstract summary: We generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants.<n>To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference.
- Score: 18.95453617434051
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In this work, we generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants to evaluate the correlation between human preferences and widely used metrics. To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference. To further the field of subjective metric evaluation, we provide open access to our dataset of generated music and human evaluations.
Related papers
- Universal Music Representations? Evaluating Foundation Models on World Music Corpora [65.72891334156706]
Foundation models have revolutionized music information retrieval, but questions remain about their ability to generalize.<n>This paper presents a comprehensive evaluation of five state-of-the-art audio foundation models across six musical corpora.
arXiv Detail & Related papers (2025-06-20T15:06:44Z) - Aligning Text-to-Music Evaluation with Human Preferences [63.08368388389259]
We study the design space of reference-based divergence metrics for evaluating generative acoustic text-to-music (TTM) models.<n>We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata.<n>We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model.
arXiv Detail & Related papers (2025-03-20T19:31:04Z) - Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound [46.7144966835279]
This paper addresses the need for automated systems capable of predicting audio aesthetics without human intervention.<n>We propose new annotation guidelines that decompose human listening perspectives into four distinct axes.<n>We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality.
arXiv Detail & Related papers (2025-02-07T18:15:57Z) - MusicRL: Aligning Music Generation to Human Preferences [62.44903326718772]
MusicRL is the first music generation system finetuned from human feedback.
We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences.
We train MusicRL-U, the first text-to-music model that incorporates human feedback at scale.
arXiv Detail & Related papers (2024-02-06T18:36:52Z) - Investigating Personalization Methods in Text to Music Generation [21.71190700761388]
Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods.
For evaluation, we construct a novel dataset with prompts and music clips.
Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody.
arXiv Detail & Related papers (2023-09-20T08:36:34Z) - A Comprehensive Survey for Evaluation Methodologies of AI-Generated
Music [14.453416870193072]
This study aims to comprehensively evaluate the subjective, objective, and combined methodologies for assessing AI-generated music.
Ultimately, this study provides a valuable reference for unifying generative AI in the field of music evaluation.
arXiv Detail & Related papers (2023-08-26T02:44:33Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - An Order-Complexity Model for Aesthetic Quality Assessment of Symbolic
Homophony Music Scores [8.751312368054016]
The quality of music score generated by AI is relatively poor compared with that created by human composers.
This paper proposes an objective quantitative evaluation method for homophony music score aesthetic quality assessment.
arXiv Detail & Related papers (2023-01-14T12:30:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.