The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement
- URL: http://arxiv.org/abs/2406.03460v1
- Date: Wed, 5 Jun 2024 17:07:39 GMT
- Title: The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement
- Authors: Danilo de Oliveira, Simon Welker, Julius Richter, Timo Gerkmann,
- Abstract summary: We introduce enhancement models that exploit the widely used PESQ measure.
While the obtained PESQ value of 3.82 would imply "state-of-the-art" PESQ-performance on the VB-DMD benchmark, our examples show that when optimizing w.r.t. a metric, an isolated evaluation on the same metric may be misleading.
- Score: 17.516851319183555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To obtain improved speech enhancement models, researchers often focus on increasing performance according to specific instrumental metrics. However, when the same metric is used in a loss function to optimize models, it may be detrimental to aspects that the given metric does not see. The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for evaluation. For this, we introduce enhancement models that exploit the widely used PESQ measure. Our "PESQetarian" model achieves 3.82 PESQ on VB-DMD while scoring very poorly in a listening experiment. While the obtained PESQ value of 3.82 would imply "state-of-the-art" PESQ-performance on the VB-DMD benchmark, our examples show that when optimizing w.r.t. a metric, an isolated evaluation on the same metric may be misleading. Instead, other metrics should be included in the evaluation and the resulting performance predictions should be confirmed by listening.
Related papers
- Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models [69.38024658668887]
Current evaluation method for event extraction relies on token-level exact match.
We propose RAEE, an automatic evaluation framework that accurately assesses event extraction results at semantic-level instead of token-level.
arXiv Detail & Related papers (2024-10-12T07:54:01Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Characterizing and Measuring Linguistic Dataset Drift [65.28821163863665]
We propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift.
These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies.
We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies.
arXiv Detail & Related papers (2023-05-26T17:50:51Z) - Hyperparameter Tuning and Model Evaluation in Causal Effect Estimation [2.7823528791601686]
This paper investigates the interplay between the four different aspects of model evaluation for causal effect estimation.
We find that most causal estimators are roughly equivalent in performance if tuned thoroughly enough.
We call for more research into causal model evaluation to unlock the optimum performance not currently being delivered even by state-of-the-art procedures.
arXiv Detail & Related papers (2023-03-02T17:03:02Z) - Metric-oriented Speech Enhancement using Diffusion Probabilistic Model [23.84172431047342]
Deep neural network based speech enhancement technique focuses on learning a noisy-to-clean transformation supervised by paired training data.
The task-specific evaluation metric (e.g., PESQ) is usually non-differentiable and can not be directly constructed in the training criteria.
We propose a metric-oriented speech enhancement method (MOSE) which integrates a metric-oriented training strategy into its reverse process.
arXiv Detail & Related papers (2023-02-23T13:12:35Z) - Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations.
Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z) - MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data [26.94528951545861]
We propose a "de-generator" which attempts to improve the robustness of the prediction network.
Experimental results on the VoiceBank-DEMAND dataset show relative improvement in PESQ score of 3.8%.
arXiv Detail & Related papers (2022-03-23T12:42:28Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement [37.3251779254894]
We propose a MetricGAN+ in which three training techniques incorporating domain-knowledge of speech processing are proposed.
With these techniques, experimental results on the VoiceBank-DEMAND dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the previous MetricGAN.
arXiv Detail & Related papers (2021-04-08T06:46:35Z) - A critical analysis of metrics used for measuring progress in artificial
intelligence [9.387811897655016]
We analyse the current landscape of performance metrics based on data covering 3867 machine learning model performance results.
Results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance.
We describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.
arXiv Detail & Related papers (2020-08-06T11:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.