Related papers: Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs

Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs

URL: http://arxiv.org/abs/2512.04827v1
Date: Thu, 04 Dec 2025 14:08:17 GMT
Title: Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs
Authors: Wenzhang Du,
Abstract summary: We propose a contract-driven QoE auditing framework.<n>We instantiate the framework on URGENT2024 MOS (6.9k speech utterances with raw rating vectors) and SingMOS v1 (7,981 singing clips; 80 systems)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Subjective mean opinion scores (MOS) remain the de-facto target for non-intrusive speech and singing quality assessment. However, MOS is a scalar that collapses heterogeneous user expectations, ignores service-level objectives, and is difficult to compare across deployment graphs. We propose a contract-driven QoE auditing framework: each service graph G is evaluated under a set of human-interpretable experience contracts C, yielding a contract-level satisfaction vector Q(G, C). We show that (i) classical MOS regression is a special case with a degenerate contract set, (ii) contract-driven quality is more stable than MOS under graph view transformations (e.g., pooling by system vs. by system type), and (iii) the effective sample complexity of learning contracts is governed by contract semantics rather than merely the dimensionality of C. We instantiate the framework on URGENT2024 MOS (6.9k speech utterances with raw rating vectors) and SingMOS v1 (7,981 singing clips; 80 systems). On URGENT, we train a contract-aware neural auditor on self-supervised WavLM embeddings; on SingMOS, we perform contract-driven graph auditing using released rating vectors and metadata without decoding audio. Empirically, our auditor matches strong MOS predictors in MOS accuracy while providing calibrated contract probabilities; on SingMOS, Q(G, C) exhibits substantially smaller cross-view drift than raw MOS and graph-only baselines; on URGENT, difficulty curves reveal that mis-specified "simple" contracts can be harder to learn than richer but better aligned contract sets.

Related papers

From Global to Granular: Revealing IQA Model Performance via Correlation Surface [83.65597122328133]
We present textbfGranularity-Modulated Correlation (GMC), which provides a structured, fine-grained analysis of IQA performance.<n>GMC includes a textbfDistribution Regulator that regularizes correlations to mitigate biases from non-uniform quality distributions.<n>Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models.
arXiv Detail & Related papers (2026-01-29T13:55:26Z)
SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment [52.656281676548645]
We introduce SingMOS-Pro, a dataset for automatic singing quality assessment.<n>SingMOS-Pro expands annotations of the additional part to include lyrics, melody, and overall quality.<n>The dataset contains 7,981 singing clips generated by 41 models across 12 datasets.
arXiv Detail & Related papers (2025-10-02T08:53:49Z)
From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling [66.22134521383909]
We introduce a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting.<n>Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling.<n>Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences.<n> Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases.
arXiv Detail & Related papers (2025-10-01T10:27:51Z)
Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models [56.055015597319674]
Reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs)<n>Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs.<n>We propose textitCo-rewarding, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views.
arXiv Detail & Related papers (2025-08-01T08:09:14Z)
SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction [1.8862680628828246]
Evaluation of voice synthesis can be done using objective metrics or subjective metrics.<n>Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5.<n>We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU)
arXiv Detail & Related papers (2025-06-02T10:45:40Z)
Learning with Noisy Low-Cost MOS for Image Quality Assessment via Dual-Bias Calibration [20.671990508960906]
In view of the subjective bias of individual annotators, the labor-abundant mean opinion score (LA-MOS) typically requires a large collection of opinion scores from multiple annotators for each image. In this paper, we aim to learn robust IQA models from low-cost MOS, which only requires very few opinion scores or even a single opinion score for each image. To the best of our knowledge, this is the first exploration of robust IQA model learning from noisy low-cost labels.
arXiv Detail & Related papers (2023-11-27T14:11:54Z)
MOSPC: MOS Prediction Based on Pairwise Comparison [32.55704173124071]
Mean opinion score (MOS) is a subjective metric to evaluate the quality of synthesized speech. We propose a general framework for MOS prediction based on pair comparison (MOSPC) Our framework surpasses the strong baseline in ranking accuracy on each fine-grained segment.
arXiv Detail & Related papers (2023-06-18T07:38:17Z)
Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z)
Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model. We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers. The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z)
Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.