MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
- URL: http://arxiv.org/abs/2508.18240v2
- Date: Mon, 15 Sep 2025 14:50:39 GMT
- Title: MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
- Authors: Yuhao Du, Qianwei Huang, Guo Zhu, Zhanchen Dai, Shunian Chen, Qiming Zhu, Le Pan, Minghao Chen, Yuhao Zhang, Li Zhou, Benyou Wang, Haizhou Li,
- Abstract summary: MTalk-Bench is a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound.<n>Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and evaluation (absolute scoring) for relative and absolute assessment.<n>Results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.
- Score: 46.82669096251444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of speech-to-speech (S2S) large language models (LLMs) has significantly improved real-time spoken interaction. However, current evaluation frameworks remain inadequate for assessing performance in complex, multi-turn dialogues. To address this, we introduce MTalk-Bench, a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound. Each dimension includes nine realistic scenarios, along with targeted tasks to assess specific capabilities such as reasoning. Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and Rubrics-based evaluation (absolute scoring) for relative and absolute assessment. The benchmark includes both model and human outputs, evaluated by human evaluators and LLMs. Experimental results reveal two sets of findings. Overall performance of S2S LLMs: (1) models excel at semantic information processing yet underperform on paralinguistic information and ambient sounds perception; (2) models typically regain coherence by increasing response length, sacrificing efficiency in multi-turn dialogues; (3) modality-aware, task-specific designs outperform brute scaling. Evaluation framework and reliability: (1) Arena and Rubrics yield consistent, complementary rankings, but reliable distinctions emerge only when performance gaps are large; (2) LLM-as-a-judge aligns with humans when gaps are clear or criteria explicit, but exhibits position and length biases and is reliable on nonverbal evaluation only with text annotations. These results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.
Related papers
- Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play [68.54773980519457]
Speech-DRAME is a unified framework that contributes at three levels.<n>It provides the first comprehensive, reproducible foundation for assessing spoken role-play.
arXiv Detail & Related papers (2025-11-03T06:12:40Z) - Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data [46.12417789276609]
Speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence.<n>We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning.
arXiv Detail & Related papers (2025-09-20T09:26:40Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement [74.51476422119457]
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction.<n>We propose textttSageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation.
arXiv Detail & Related papers (2025-08-28T15:47:37Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models [8.882948576463244]
We propose URO-Bench, an extensive benchmark for speech-to-speech (S2S) scenarios.<n>URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics.<n>Our benchmark is divided into two difficulty levels: basic track and pro track, each comprising 20 test sets.
arXiv Detail & Related papers (2025-02-25T03:31:48Z) - VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models [32.086847480051084]
We present VoxEval, a novel SpeechQA benchmark that assesses knowledge understanding through pure speech interactions.<n>Our benchmark 1) maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format.
arXiv Detail & Related papers (2025-01-09T04:30:12Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.