Related papers: The State Of TTS: A Case Study with Human Fooling Rates

The State Of TTS: A Case Study with Human Fooling Rates

URL: http://arxiv.org/abs/2508.04179v1
Date: Wed, 06 Aug 2025 08:04:21 GMT
Title: The State Of TTS: A Case Study with Human Fooling Rates
Authors: Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra,
Abstract summary: We introduce Human Fooling Rate (HFR), a metric that measures how often machine-generated speech is mistaken for human.<n>Our large-scale evaluation of open-source and commercial TTS models reveals critical insights.
Score: 17.046410804692332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

Related papers

Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese [36.208204572097046]
We introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a Turing-Test-inspired evaluation protocol.<n>ATT asks evaluators to judge whether a voice sounds human.<n>We also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation.
arXiv Detail & Related papers (2025-05-16T12:57:23Z)
Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation [12.954531089716008]
MUSHRA test is a promising alternative for evaluating TTS systems simultaneously.<n>We show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems.<n>We propose two refined variants of the MUSHRA test.
arXiv Detail & Related papers (2024-11-19T18:37:45Z)
Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR [13.307889110301502]
We compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio between real and synthetic speech WER to date (1.46), but also find that a large gap remains.
arXiv Detail & Related papers (2024-10-16T06:35:56Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback [39.54647336161013]
We propose a sampling-annotating-learning framework tailored to text-to-speech (TTS) optimization. We show that UNO considerably improves the zero-shot performance of TTS models in terms of MOS, word error rate, and speaker similarity. We also present a remarkable ability of UNO that it can adapt to the desired speaking style in emotional TTS seamlessly and flexibly.
arXiv Detail & Related papers (2024-06-02T07:54:33Z)
Beyond human subjectivity and error: a novel AI grading system [67.410870290301]
The grading of open-ended questions is a high-effort, high-impact task in education. Recent breakthroughs in AI technology might facilitate such automation, but this has not been demonstrated at scale. We introduce a novel automatic short answer grading (ASAG) system.
arXiv Detail & Related papers (2024-05-07T13:49:59Z)
A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos [81.54357891748087]
We collect talking head videos generated from four generative methods. We conduct controlled psychophysical experiments on visual quality, lip-audio synchronization, and head movement naturalness. Our experiments validate consistency between model predictions and human annotations, identifying metrics that align better with human opinions than widely-used measures.
arXiv Detail & Related papers (2024-03-11T04:13:38Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation. We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z)
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.