SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
- URL: http://arxiv.org/abs/2508.02013v2
- Date: Tue, 05 Aug 2025 09:00:39 GMT
- Title: SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
- Authors: Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, Jingqi Tong, Yilong Wu, Baoyu Fan, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang,
- Abstract summary: Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
- Score: 52.29009595100625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.
Related papers
- C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations [23.11314388159895]
Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries.<n>Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue.
arXiv Detail & Related papers (2025-07-30T17:56:23Z) - Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z) - OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction [123.89581506075461]
We propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency.<n> Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction.<n>Our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms.
arXiv Detail & Related papers (2025-05-26T17:55:06Z) - Multimodal Conversation Structure Understanding [12.29827265137757]
Large language models' ability to understand fine-grained conversational structure remains underexplored.<n>We present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants.<n>We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging.
arXiv Detail & Related papers (2025-05-23T06:41:54Z) - VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models [26.34810950257782]
We propose VocalBench, a benchmark designed to evaluate speech interaction models' capabilities in vocal communication.<n>VocalBench comprises 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness.<n> Experimental results reveal significant variability in current model capabilities, each exhibiting distinct strengths and weaknesses.
arXiv Detail & Related papers (2025-05-21T16:34:07Z) - Speaker Verification in Agent-Generated Conversations [47.6291644653831]
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks.
This study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker.
arXiv Detail & Related papers (2024-05-16T14:46:18Z) - SocialBench: Sociality Evaluation of Role-Playing Conversational Agents [85.6641890712617]
Large language models (LLMs) have advanced the development of various AI conversational agents.
SocialBench is the first benchmark designed to evaluate the sociality of role-playing conversational agents at both individual and group levels.
We find that agents excelling in individual level does not imply their proficiency in group level.
arXiv Detail & Related papers (2024-03-20T15:38:36Z) - Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics.
We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context.
Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.