Related papers: MMMOS: Multi-domain Multi-axis Audio Quality Assessment

MMMOS: Multi-domain Multi-axis Audio Quality Assessment

URL: http://arxiv.org/abs/2507.04094v1
Date: Sat, 05 Jul 2025 16:42:09 GMT
Title: MMMOS: Multi-domain Multi-axis Audio Quality Assessment
Authors: Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee,
Abstract summary: Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech.<n>We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness.<n> MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's tau versus baseline.
Score: 49.48516314472825
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's {\tau} versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.

Related papers

JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs [0.0]
Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality.<n>We propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction.<n> Experimental results suggest that perceptually-inspired contrastive pretraining significantly improves the model performance evaluated by various metrics when compared against the same network trained from scratch without pretraining.
arXiv Detail & Related papers (2025-07-15T18:16:46Z)
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation [81.26818054877658]
MMMG is a comprehensive benchmark for multimodal generation across 4 modality combinations.<n>It is highly aligned with human evaluation, achieving an average agreement of 94.3%.<n>GPT Image achieves 78.3% accuracy for image generation, but falls short on multimodal reasoning and interleaved generation.
arXiv Detail & Related papers (2025-05-23T08:21:28Z)
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix [50.71803775663387]
MMAR comprises 1,000 meticulously curated audio-question-answer triplets.<n>MMAR extends existing benchmarks to a broad spectrum of real-world audio scenarios.<n>We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs)
arXiv Detail & Related papers (2025-05-19T12:18:42Z)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings.<n>This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation.<n>We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z)
Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models [60.72029578488467]
Adrial audio attacks pose a significant threat to the growing use of large audio-language models (LALMs) in human-machine interactions.<n>We introduce the Chat-Audio Attacks benchmark including four distinct types of audio attacks.<n>We evaluate six state-of-the-art LALMs with voice interaction capabilities, including Gemini-1.5-Pro, GPT-4o, and others.
arXiv Detail & Related papers (2024-11-22T10:30:48Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.