Related papers: Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis

Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis

URL: http://arxiv.org/abs/2507.06116v1
Date: Tue, 08 Jul 2025 16:00:13 GMT
Title: Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis
Authors: Xintong Hu, Yixuan Chen, Rui Yang, Wenxiang Guo, Changhao Pan,
Abstract summary: This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models.<n>Our method builds upon existing self-supervised models such as wav2vec2, designing a specialized MoE architecture to address different types of speech quality assessment tasks.<n>Despite the adoption of the MoE architecture and expanded dataset, the model's performance improvements in sentence-level prediction tasks remain limited.
Score: 3.7818013031679683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic speech quality assessment plays a crucial role in the development of speech synthesis systems, but existing models exhibit significant performance variations across different granularity levels of prediction tasks. This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models, incorporating a Mixture of Experts (MoE) classification head and utilizing synthetic data from multiple commercial generation models for data augmentation. Our method builds upon existing self-supervised models such as wav2vec2, designing a specialized MoE architecture to address different types of speech quality assessment tasks. We also collected a large-scale synthetic speech dataset encompassing the latest text-to-speech, speech conversion, and speech enhancement systems. However, despite the adoption of the MoE architecture and expanded dataset, the model's performance improvements in sentence-level prediction tasks remain limited. Our work reveals the limitations of current methods in handling sentence-level quality assessment, provides new technical pathways for the field of automatic speech quality assessment, and also delves into the fundamental causes of performance differences across different assessment granularities.

Related papers

ETTA: Elucidating the Design Space of Text-to-Audio Models [33.831803213869605]
We study the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks.<n>We propose our best model dubbed Elucidated Text-To-Audio (ETTA)<n>ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data.
arXiv Detail & Related papers (2024-12-26T21:13:12Z)
Analyzing Persuasive Strategies in Meme Texts: A Fusion of Language Models with Paraphrase Enrichment [0.23020018305241333]
This paper describes our approach to hierarchical multi-label detection of persuasion techniques in meme texts. The scope of the study encompasses enhancing model performance through innovative training techniques and data augmentation strategies.
arXiv Detail & Related papers (2024-07-01T20:25:20Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years. We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM. Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z)
A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement [20.329872147913584]
We compare different methods of incorporating phonetic information in a speech enhancement model. We observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance.
arXiv Detail & Related papers (2022-06-22T12:00:50Z)
Speech Emotion Recognition using Self-Supervised Features [14.954994969217998]
We introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. The proposed monomodal speechonly based system achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features.
arXiv Detail & Related papers (2022-02-07T00:50:07Z)
Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features [31.59528815233441]
We propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously.<n> Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction.
arXiv Detail & Related papers (2021-11-03T17:30:43Z)
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction. We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings. We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)
Enhancing Dialogue Generation via Multi-Level Contrastive Learning [57.005432249952406]
We propose a multi-level contrastive learning paradigm to model the fine-grained quality of the responses with respect to the query. A Rank-aware (RC) network is designed to construct the multi-level contrastive optimization objectives. We build a Knowledge Inference (KI) component to capture the keyword knowledge from the reference during training and exploit such information to encourage the generation of informative words.
arXiv Detail & Related papers (2020-09-19T02:41:04Z)
Hybrid Autoregressive Transducer (hat) [11.70833387055716]
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model. It is a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. We evaluate our proposed model on a large-scale voice search task.
arXiv Detail & Related papers (2020-03-12T20:47:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.