FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction
- URL: http://arxiv.org/abs/2509.22243v1
- Date: Fri, 26 Sep 2025 11:57:42 GMT
- Title: FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction
- Authors: Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, Jingbo Zhu,
- Abstract summary: Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
- Score: 49.83226596963294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Full-Duplex Speech-to-Speech Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling real-time spoken dialogue systems. However, benchmarking and modeling these models remains a fundamental challenge. We introduce FLEXI, the first benchmark for full-duplex LLM-human spoken interaction that explicitly incorporates model interruption in emergency scenarios. FLEXI systematically evaluates the latency, quality, and conversational effectiveness of real-time dialogue through six diverse human-LLM interaction scenarios, revealing significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency. Finally, we suggest that next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.
Related papers
- The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era [95.35748535806744]
We launch the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026.<n>This paper summarizes the dataset, track configurations, and the final results.
arXiv Detail & Related papers (2026-01-09T06:32:30Z) - MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models [48.34642579013783]
MTR-DuplexBench is a novel benchmark for evaluating FDSLMs in multi-round settings.<n>We show that MTR-DuplexBench provides comprehensive, turn-by-turn evaluation of FDSLMs across dialogue quality, conversational dynamics, following instruction, and safety.
arXiv Detail & Related papers (2025-11-13T12:50:04Z) - End-to-end Listen, Look, Speak and Act [22.047534228540783]
ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial intelligence.<n>At its core is a novel SA-MoE (Attention Mixture-of-Experts) that routes each modality to specialized experts fuses them through a unified attention backbone.
arXiv Detail & Related papers (2025-10-19T08:45:46Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.<n>We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.<n>Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Language Model Can Listen While Speaking [17.584201137311286]
Listen-while-speaking language model (LSLM) is an end-to-end system equipped with both listening and speaking channels.
Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems.
arXiv Detail & Related papers (2024-08-05T16:47:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.