SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment
- URL: http://arxiv.org/abs/2510.01812v2
- Date: Fri, 03 Oct 2025 05:07:06 GMT
- Title: SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment
- Authors: Yuxun Tang, Lan Liu, Wenhao Feng, Yiwen Zhao, Jionghao Han, Yifeng Yu, Jiatong Shi, Qin Jin,
- Abstract summary: We introduce SingMOS-Pro, a dataset for automatic singing quality assessment.<n>SingMOS-Pro expands annotations of the additional part to include lyrics, melody, and overall quality.<n>The dataset contains 7,981 singing clips generated by 41 models across 12 datasets.
- Score: 52.656281676548645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro expands annotations of the additional part to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent advances. Each clip receives at least five ratings from professional annotators, ensuring reliability and consistency. Furthermore, we explore how to effectively utilize MOS data annotated under different standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset can be accessed at https://huggingface.co/datasets/TangRain/SingMOS-Pro.
Related papers
- Singing Timbre Popularity Assessment Based on Multimodal Large Foundation Model [28.382926227472026]
We introduce Sing-MD, a large-scale dataset annotated by experts across four dimensions: breath control, timbre quality, emotional expression, and vocal technique.<n>Second, addressing the memory limitations of Multimodal Large Language Models (MLLMs) in analyzing full-length songs, we propose VocalVerse.<n>Third, to address automated metric shortcomings, we establish the H-TPR benchmark, which evaluates a model's ability to generate perceptually valid rankings.
arXiv Detail & Related papers (2025-12-07T21:06:16Z) - Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs [0.0]
We propose a contract-driven QoE auditing framework.<n>We instantiate the framework on URGENT2024 MOS (6.9k speech utterances with raw rating vectors) and SingMOS v1 (7,981 singing clips; 80 systems)
arXiv Detail & Related papers (2025-12-04T14:08:17Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction [1.8862680628828246]
Evaluation of voice synthesis can be done using objective metrics or subjective metrics.<n>Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5.<n>We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU)
arXiv Detail & Related papers (2025-06-02T10:45:40Z) - Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation [25.476596046882854]
Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM)<n>We propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment.
arXiv Detail & Related papers (2024-09-25T05:44:44Z) - GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks [52.30565320125514]
GTSinger is a large global, multi-technique, free-to-use, high-quality singing corpus with realistic music scores.<n>We collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset.<n>We conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion.
arXiv Detail & Related papers (2024-09-20T18:18:14Z) - The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models [63.53530525014976]
ZIQI-Eval is a benchmark specifically designed to evaluate the music-related capabilities of large language models (LLMs)
ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries.
Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities.
arXiv Detail & Related papers (2024-06-22T16:24:42Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - MOSPC: MOS Prediction Based on Pairwise Comparison [32.55704173124071]
Mean opinion score (MOS) is a subjective metric to evaluate the quality of synthesized speech.
We propose a general framework for MOS prediction based on pair comparison (MOSPC)
Our framework surpasses the strong baseline in ranking accuracy on each fine-grained segment.
arXiv Detail & Related papers (2023-06-18T07:38:17Z) - Speech MOS multi-task learning and rater bias correction [10.123346550775471]
Mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample.
Here we propose a multi-task framework to include additional labels and data in training to improve the performance of a blind MOS estimation model.
arXiv Detail & Related papers (2022-12-04T20:06:27Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.