From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling
- URL: http://arxiv.org/abs/2510.00743v1
- Date: Wed, 01 Oct 2025 10:27:51 GMT
- Title: From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling
- Authors: Yifei Cao, Changhao Jiang, Jiabao Zhuang, Jiajun Sun, Ming Zhang, Zhiheng Xi, Hui Li, Shihan Dou, Yuran Wang, Yunke Zhang, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang,
- Abstract summary: We introduce a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting.<n>Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling.<n>Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences.<n> Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases.
- Score: 66.22134521383909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.
Related papers
- Learning Ordinal Probabilistic Reward from Preferences [25.069054134899744]
We introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM)<n>Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response.<n>Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT)
arXiv Detail & Related papers (2026-02-13T06:43:02Z) - Reward Model Interpretability via Optimal and Pessimal Tokens [4.951383975460995]
Reward modeling has emerged as a crucial component in aligning large language models with human values.<n>We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space.<n>We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training.
arXiv Detail & Related papers (2025-06-08T23:56:58Z) - Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z) - Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality [45.34252727738116]
Mean Opinion Score (MOS) is no longer sufficient as the sole supervisory signal for multimedia quality assessment models.<n>By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.
arXiv Detail & Related papers (2025-05-26T08:52:02Z) - RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [37.97757796124621]
RM-Bench is a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases.
We evaluate nearly 40 reward models on RM-Bench and find that even state-of-the-art models achieve an average performance of only 46.6%.
arXiv Detail & Related papers (2024-10-21T16:48:26Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning.
The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z) - Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning
With Spoofing Detection and Spoofing Type Classification [16.43844160498413]
We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model.
Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
arXiv Detail & Related papers (2020-07-16T11:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.