APG-MOS: Auditory Perception Guided-MOS Predictor for Synthetic Speech
- URL: http://arxiv.org/abs/2504.20447v1
- Date: Tue, 29 Apr 2025 05:45:09 GMT
- Title: APG-MOS: Auditory Perception Guided-MOS Predictor for Synthetic Speech
- Authors: Zhicheng Lian, Lizhi Wang, Hua Huang,
- Abstract summary: Speech quality assessment aims to quantify subjective human perception of speech.<n>We propose an auditory perception guided-MOS prediction model (APG-MOS)<n>APG-MOS integrates auditory modeling with semantic analysis to enhance consistency with human judgments.
- Score: 20.311467577180274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech quality assessment aims to quantify subjective human perception of speech through computational models to reduce the need for labor-consuming manual evaluations. While models based on deep learning have achieved progress in predicting mean opinion scores (MOS) to assess synthetic speech, the neglect of fundamental auditory perception mechanisms limits consistency with human judgments. To address this issue, we propose an auditory perception guided-MOS prediction model (APG-MOS) that synergistically integrates auditory modeling with semantic analysis to enhance consistency with human judgments. Specifically, we first design a perceptual module, grounded in biological auditory mechanisms, to simulate cochlear functions, which encodes acoustic signals into biologically aligned electrochemical representations. Secondly, we propose a residual vector quantization (RVQ)-based semantic distortion modeling method to quantify the degradation of speech quality at the semantic level. Finally, we design a residual cross-attention architecture, coupled with a progressive learning strategy, to enable multimodal fusion of encoded electrochemical signals and semantic representations. Experiments demonstrate that APG-MOS achieves superior performance on two primary benchmarks. Our code and checkpoint will be available on a public repository upon publication.
Related papers
- $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning.
The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z) - Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations.
Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features [31.59528815233441]
We propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously.<n> Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling [16.43844160498413]
Several recent papers have proposed deep-learning-based assessment models.
We propose three models using cluster-based modeling methods.
We show that the GQT layer helps to predict human assessment better by automatically learning the task.
arXiv Detail & Related papers (2020-08-09T11:14:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.