DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to
evaluate Noise Suppressors
- URL: http://arxiv.org/abs/2010.15258v2
- Date: Wed, 10 Feb 2021 22:37:29 GMT
- Title: DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to
evaluate Noise Suppressors
- Authors: Chandan K A Reddy, Vishak Gopal, Ross Cutler
- Abstract summary: This paper introduces a multi-stage self-teaching based perceptual objective metric to evaluate noise suppressors.
The proposed method generalizes well in challenging test conditions with a high correlation to human ratings.
- Score: 15.209645076557054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human subjective evaluation is the gold standard to evaluate speech quality
optimized for human perception. Perceptual objective metrics serve as a proxy
for subjective scores. The conventional and widely used metrics require a
reference clean speech signal, which is unavailable in real recordings. The
no-reference approaches correlate poorly with human ratings and are not widely
adopted in the research community. One of the biggest use cases of these
perceptual objective metrics is to evaluate noise suppression algorithms. This
paper introduces a multi-stage self-teaching based perceptual objective metric
that is designed to evaluate noise suppressors. The proposed method generalizes
well in challenging test conditions with a high correlation to human ratings.
Related papers
- Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation [3.549112490210998]
We comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Mean Discrepancy (SMMD) under varied embeddings and conditions.<n>We show that FSD and SMMD can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible.
arXiv Detail & Related papers (2026-01-29T08:20:52Z) - On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation [88.77441715819366]
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content.<n>We propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity.
arXiv Detail & Related papers (2026-01-09T22:01:56Z) - SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction [1.8862680628828246]
Evaluation of voice synthesis can be done using objective metrics or subjective metrics.<n>Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5.<n>We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU)
arXiv Detail & Related papers (2025-06-02T10:45:40Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge [19.810337081901178]
Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals.
This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain.
The UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain.
arXiv Detail & Related papers (2024-02-02T13:45:42Z) - Human Feedback is not Gold Standard [28.63384327791185]
We critically analyse the use of human feedback for both training and evaluation.
We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality.
arXiv Detail & Related papers (2023-09-28T11:18:20Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - SpeechLMScore: Evaluating speech generation using speech language model [43.20067175503602]
We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model.
It does not require human annotation and is a highly scalable framework.
Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks.
arXiv Detail & Related papers (2022-12-08T21:00:15Z) - Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations.
Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z) - Benchmarking Evaluation Metrics for Code-Switching Automatic Speech
Recognition [19.763431520942028]
We develop a benchmark data set of code-switching speech recognition hypotheses with human judgments.
We define clear guidelines for minimal editing of automatic hypotheses.
We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
arXiv Detail & Related papers (2022-11-22T08:14:07Z) - Evaluating generative audio systems and their metrics [80.97828572629093]
This paper investigates state-of-the-art approaches side-by-side with (i) a set of previously proposed objective metrics for audio reconstruction, and (ii) a listening study.
Results indicate that currently used objective metrics are insufficient to describe the perceptual quality of current systems.
arXiv Detail & Related papers (2022-08-31T21:48:34Z) - HASA-net: A non-intrusive hearing-aid speech assessment network [52.83357278948373]
We propose a DNN-based hearing aid speech assessment network (HASA-Net) to predict speech quality and intelligibility scores simultaneously.
To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids.
Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics.
arXiv Detail & Related papers (2021-11-10T14:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.