Related papers: EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

URL: http://arxiv.org/abs/2505.23009v1
Date: Thu, 29 May 2025 02:36:24 GMT
Title: EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
Authors: Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, Alex Smola,
Abstract summary: We introduce $textitEmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios.<n>Our framework automates both test-case generation and evaluation, making the benchmark easily accessible.<n>We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval.
Score: 25.51206687438354
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on $\textit{EmergentTTS}$, we introduce $\textit{EmergentTTS-Eval}$, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation $\href{https://github.com/boson-ai/EmergentTTS-Eval-public}{code}$ and the $\href{https://huggingface.co/datasets/bosonai/EmergentTTS-Eval}{dataset}$.

Related papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech [0.0]
NonverbalTTS (NVTTS) is a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories.<n>We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators.
arXiv Detail & Related papers (2025-07-17T14:17:40Z)
Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese [36.208204572097046]
We introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a Turing-Test-inspired evaluation protocol.<n>ATT asks evaluators to judge whether a voice sounds human.<n>We also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation.
arXiv Detail & Related papers (2025-05-16T12:57:23Z)
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens [31.575335190916995]
We introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech that decomposes speech into two complementary token types.<n>To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations.
arXiv Detail & Related papers (2025-03-03T16:23:10Z)
Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems. We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z)
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis [84.57932472551889]
RALL-E is a robust language modeling method for text-to-speech synthesis. RALL-E improves the WER of zero-shot TTS from $5.6%$ (without reranking) to $2.5%$ and $1.0%$, respectively.
arXiv Detail & Related papers (2024-04-04T05:15:07Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T) We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z)
Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study [44.07589545984369]
We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies. We show how careful selection of data, yet smaller amounts, can improve the efficiency of TTS system. Our objective evaluation shows 3.9% character error rate (CER), while the groundtruth has 1.3% CER.
arXiv Detail & Related papers (2023-01-22T10:41:58Z)
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems. We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z)
ESPnet2-TTS: Extending the Edge of TTS Research [62.92178873052468]
ESPnet2-TTS is an end-to-end text-to-speech (E2E-TTS) toolkit. New features include: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling.
arXiv Detail & Related papers (2021-10-15T03:27:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.