CCATMos: Convolutional Context-aware Transformer Network for
Non-intrusive Speech Quality Assessment
- URL: http://arxiv.org/abs/2211.02577v1
- Date: Fri, 4 Nov 2022 16:46:11 GMT
- Title: CCATMos: Convolutional Context-aware Transformer Network for
Non-intrusive Speech Quality Assessment
- Authors: Yuchen Liu, Li-Chia Yang, Alex Pawlicki, Marko Stamenovic
- Abstract summary: We propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters.
We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the ConferencingSpeech 2022 Challenge.
- Score: 12.497279501767606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech quality assessment has been a critical component in many voice
communication related applications such as telephony and online conferencing.
Traditional intrusive speech quality assessment requires the clean reference of
the degraded utterance to provide an accurate quality measurement. This
requirement limits the usability of these methods in real-world scenarios. On
the other hand, non-intrusive subjective measurement is the ``golden standard"
in evaluating speech quality as human listeners can intrinsically evaluate the
quality of any degraded speech with ease. In this paper, we propose a novel
end-to-end model structure called Convolutional Context-Aware Transformer
(CCAT) network to predict the mean opinion score (MOS) of human raters. We
evaluate our model on three MOS-annotated datasets spanning multiple languages
and distortion types and submit our results to the ConferencingSpeech 2022
Challenge. Our experiments show that CCAT provides promising MOS predictions
compared to current state-of-art non-intrusive speech assessment models with
average Pearson correlation coefficient (PCC) increasing from 0.530 to 0.697
and average RMSE decreasing from 0.768 to 0.570 compared to the baseline model
on the challenge evaluation test set.
Related papers
- Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation [8.170174172545831]
This paper addresses issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024.
We present an evaluation protocol combining objective metric, namely Fr'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation.
arXiv Detail & Related papers (2024-10-23T06:35:41Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context [7.567181073057191]
This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity.
It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment.
arXiv Detail & Related papers (2024-03-29T13:59:34Z) - Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - SpeechLMScore: Evaluating speech generation using speech language model [43.20067175503602]
We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model.
It does not require human annotation and is a highly scalable framework.
Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks.
arXiv Detail & Related papers (2022-12-08T21:00:15Z) - Using Rater and System Metadata to Explain Variance in the VoiceMOS
Challenge 2022 Dataset [71.93633698146002]
The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels.
This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset.
arXiv Detail & Related papers (2022-09-14T00:45:49Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features [30.57631206882462]
The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.