Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation
- URL: http://arxiv.org/abs/2601.21386v1
- Date: Thu, 29 Jan 2026 08:20:52 GMT
- Title: Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation
- Authors: June-Woo Kim, Dhruv Agarwal, Federica Cerina,
- Abstract summary: We comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Mean Discrepancy (SMMD) under varied embeddings and conditions.<n>We show that FSD and SMMD can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible.
- Score: 3.549112490210998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible. Code is available at https://github.com/kaen2891/FrechetSpeechDistance.
Related papers
- SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation [52.468945848774844]
This paper addresses the need for automated systems capable of evaluating audio separation without human intervention.<n>The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric.<n>SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation.
arXiv Detail & Related papers (2026-01-27T15:29:02Z) - VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency [28.98083807303608]
Speech-LLMs show strong performance in many applications, but their robustness is critically under-tested, especially to speech disfluency.<n>This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments.
arXiv Detail & Related papers (2025-10-17T08:01:41Z) - SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions [54.34001921326444]
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems.<n>Existing benchmarks evaluate only subsets of these conditions, missing others entirely.<n>We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks.
arXiv Detail & Related papers (2025-09-21T14:11:16Z) - SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction [1.8862680628828246]
Evaluation of voice synthesis can be done using objective metrics or subjective metrics.<n>Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5.<n>We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU)
arXiv Detail & Related papers (2025-06-02T10:45:40Z) - Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation [12.954531089716008]
MUSHRA test is a promising alternative for evaluating TTS systems simultaneously.<n>We show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems.<n>We propose two refined variants of the MUSHRA test.
arXiv Detail & Related papers (2024-11-19T18:37:45Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback [39.54647336161013]
We propose a sampling-annotating-learning framework tailored to text-to-speech (TTS) optimization.
We show that UNO considerably improves the zero-shot performance of TTS models in terms of MOS, word error rate, and speaker similarity.
We also present a remarkable ability of UNO that it can adapt to the desired speaking style in emotional TTS seamlessly and flexibly.
arXiv Detail & Related papers (2024-06-02T07:54:33Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations.
Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to
evaluate Noise Suppressors [15.209645076557054]
This paper introduces a multi-stage self-teaching based perceptual objective metric to evaluate noise suppressors.
The proposed method generalizes well in challenging test conditions with a high correlation to human ratings.
arXiv Detail & Related papers (2020-10-28T22:19:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.