MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
- URL: http://arxiv.org/abs/2511.10262v1
- Date: Fri, 14 Nov 2025 01:42:18 GMT
- Title: MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
- Authors: He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Shaohua Ma, Irwin King,
- Abstract summary: MTR-DuplexBench is a novel benchmark for evaluating FDSLMs in multi-round settings.<n>We show that MTR-DuplexBench provides comprehensive, turn-by-turn evaluation of FDSLMs across dialogue quality, conversational dynamics, following instruction, and safety.
- Score: 48.34642579013783
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.
Related papers
- ChatUMM: Robust Context Tracking for Conversational Interleaved Generation [44.19929499646892]
Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm.<n>We present ChatUMM, a conversational unified model that excels at robust context tracking to sustain interleaved multimodal generation.<n>ChatUMM derives its capabilities from an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow.
arXiv Detail & Related papers (2026-02-06T07:11:50Z) - Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z) - AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding [73.05946667683259]
Recent large language models (MLLMs) show strong perception but struggle in multi-speaker, dialogue-centric settings.<n>We introduce AMUSE, a benchmark designed around tasks that are inherently agentic.<n>We propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation.
arXiv Detail & Related papers (2025-12-18T07:01:47Z) - FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z) - From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models [12.741006204459637]
Full-Duplex voice communication enables simultaneous listening and speaking with natural turn-taking, overlapping speech, and interruptions.<n>This survey comprehensively reviews Full-Duplex Spoken Language Models (FD-SLMs)<n>We identify fundamental challenges: synchronous data scarcity, architectural divergence, and evaluation gaps.
arXiv Detail & Related papers (2025-09-18T01:00:58Z) - F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model [18.21432204057241]
Fine-grained Fragment Retrieval (FFR) is a task requiring models to locate query-relevant fragments.<n>F2RVLM is a generative retrieval model trained in a two-stage paradigm.<n>It outperforms popular Vision-Language Models in both in-domain and real-domain settings.
arXiv Detail & Related papers (2025-08-25T06:42:47Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.