MMMOS: Multi-domain Multi-axis Audio Quality Assessment
- URL: http://arxiv.org/abs/2507.04094v1
- Date: Sat, 05 Jul 2025 16:42:09 GMT
- Title: MMMOS: Multi-domain Multi-axis Audio Quality Assessment
- Authors: Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee,
- Abstract summary: Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech.<n>We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness.<n> MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's tau versus baseline.
- Score: 49.48516314472825
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's {\tau} versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.
Related papers
- JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs [0.0]
Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality.<n>We propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction.<n> Experimental results suggest that perceptually-inspired contrastive pretraining significantly improves the model performance evaluated by various metrics when compared against the same network trained from scratch without pretraining.
arXiv Detail & Related papers (2025-07-15T18:16:46Z) - MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation [81.26818054877658]
MMMG is a comprehensive benchmark for multimodal generation across 4 modality combinations.<n>It is highly aligned with human evaluation, achieving an average agreement of 94.3%.<n>GPT Image achieves 78.3% accuracy for image generation, but falls short on multimodal reasoning and interleaved generation.
arXiv Detail & Related papers (2025-05-23T08:21:28Z) - MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix [50.71803775663387]
MMAR comprises 1,000 meticulously curated audio-question-answer triplets.<n>MMAR extends existing benchmarks to a broad spectrum of real-world audio scenarios.<n>We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs)
arXiv Detail & Related papers (2025-05-19T12:18:42Z) - Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings.<n>This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation.<n>We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z) - Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models [60.72029578488467]
Adrial audio attacks pose a significant threat to the growing use of large audio-language models (LALMs) in human-machine interactions.<n>We introduce the Chat-Audio Attacks benchmark including four distinct types of audio attacks.<n>We evaluate six state-of-the-art LALMs with voice interaction capabilities, including Gemini-1.5-Pro, GPT-4o, and others.
arXiv Detail & Related papers (2024-11-22T10:30:48Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning.
The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.