EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
- URL: http://arxiv.org/abs/2505.23009v1
- Date: Thu, 29 May 2025 02:36:24 GMT
- Title: EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
- Authors: Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, Alex Smola,
- Abstract summary: We introduce $textitEmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios.<n>Our framework automates both test-case generation and evaluation, making the benchmark easily accessible.<n>We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval.
- Score: 25.51206687438354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on $\textit{EmergentTTS}$, we introduce $\textit{EmergentTTS-Eval}$, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation $\href{https://github.com/boson-ai/EmergentTTS-Eval-public}{code}$ and the $\href{https://huggingface.co/datasets/bosonai/EmergentTTS-Eval}{dataset}$.
Related papers
- NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech [0.0]
NonverbalTTS (NVTTS) is a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories.<n>We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators.
arXiv Detail & Related papers (2025-07-17T14:17:40Z) - Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese [36.208204572097046]
We introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a Turing-Test-inspired evaluation protocol.<n>ATT asks evaluators to judge whether a voice sounds human.<n>We also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation.
arXiv Detail & Related papers (2025-05-16T12:57:23Z) - Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens [31.575335190916995]
We introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech that decomposes speech into two complementary token types.<n>To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations.
arXiv Detail & Related papers (2025-03-03T16:23:10Z) - Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis [84.57932472551889]
RALL-E is a robust language modeling method for text-to-speech synthesis.
RALL-E improves the WER of zero-shot TTS from $5.6%$ (without reranking) to $2.5%$ and $1.0%$, respectively.
arXiv Detail & Related papers (2024-04-04T05:15:07Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a
Case Study [44.07589545984369]
We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies.
We show how careful selection of data, yet smaller amounts, can improve the efficiency of TTS system.
Our objective evaluation shows 3.9% character error rate (CER), while the groundtruth has 1.3% CER.
arXiv Detail & Related papers (2023-01-22T10:41:58Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - ESPnet2-TTS: Extending the Edge of TTS Research [62.92178873052468]
ESPnet2-TTS is an end-to-end text-to-speech (E2E-TTS) toolkit.
New features include: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling.
arXiv Detail & Related papers (2021-10-15T03:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.