VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents
- URL: http://arxiv.org/abs/2509.03940v1
- Date: Thu, 04 Sep 2025 07:03:46 GMT
- Title: VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents
- Authors: Weihao Wu, Liang Cao, Xinyu Wu, Zhiwei Lin, Rui Niu, Jingbei Li, Zhiyong Wu,
- Abstract summary: We introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs.<n>The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies.
- Score: 25.2769510663867
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.
Related papers
- Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play [68.54773980519457]
Speech-DRAME is a unified framework that contributes at three levels.<n>It provides the first comprehensive, reproducible foundation for assessing spoken role-play.
arXiv Detail & Related papers (2025-11-03T06:12:40Z) - SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z) - SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents [52.29009595100625]
Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
arXiv Detail & Related papers (2025-08-04T03:18:36Z) - RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing [111.06936588273868]
RMTBench is a comprehensive textbfuser-centric bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds.<n>Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications.<n>By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements.
arXiv Detail & Related papers (2025-07-27T16:49:47Z) - Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent [18.67432557362308]
Test-Time-Matching (TTM) is a training-free role-playing framework through test-time scaling and context engineering.<n>Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing.<n>It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory.
arXiv Detail & Related papers (2025-07-22T17:47:44Z) - AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z) - A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.