Related papers: FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

URL: http://arxiv.org/abs/2507.19040v1
Date: Fri, 25 Jul 2025 07:51:22 GMT
Title: FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems
Authors: Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng,
Abstract summary: Existing benchmarks for FD scenes, e.g., evaluating model performance lack metrics for FD scenes.<n>This paper assesses FDSDS's ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with novel metrics.<n>We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours generated speech, with 1,200 conversations simulated interruptions.
Score: 25.6510200528785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS's ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.

Related papers

Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty [37.15356899831919]
Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams.<n>We propose a novel neuro-inspired non-blocking inference paradigm that employs adaptive temporal windows of integration.<n>Our framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff.
arXiv Detail & Related papers (2025-11-20T10:48:54Z)
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models [48.34642579013783]
MTR-DuplexBench is a novel benchmark for evaluating FDSLMs in multi-round settings.<n>We show that MTR-DuplexBench provides comprehensive, turn-by-turn evaluation of FDSLMs across dialogue quality, conversational dynamics, following instruction, and safety.
arXiv Detail & Related papers (2025-11-13T12:50:04Z)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework [51.50565654314582]
Large language models can follow users' instructions throughout a dialogue spanning multiple topics.<n>Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience.<n>We propose a framework for assessing multi-turn instruction-following ability.
arXiv Detail & Related papers (2025-11-05T14:39:59Z)
Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems [82.70507055599093]
We propose a Streaming Chain-of-Thought (CoT) framework for Duplex SDS.<n>We create intermediate targets-aligned user transcripts and system responses for each block.<n>Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods.
arXiv Detail & Related papers (2025-10-02T14:33:05Z)
FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z)
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning [54.47710436807661]
MORSE-500 is a video benchmark composed of 500 fully scripted clips embedded questions spanning six complementary reasoning categories.<n>Each instance is generated using deterministic Python scripts (Manim, Matplotlib, MoviePy), generative video models, and real footage.<n>Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve.
arXiv Detail & Related papers (2025-06-05T19:12:45Z)
SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation [17.56310064245171]
SALMON-N-omni is the first single standalone full-byte speech LLM that operates without its token transition backbone.<n>It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when between speaking and listening.<n> SALMON-N-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling echo cancellation and context-dependent bargein.
arXiv Detail & Related papers (2025-05-17T08:13:59Z)
DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up.<n>It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention.<n>It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z)
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z)
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z)
Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models [16.920823078873095]
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword. We show on the real-world dataset of follow-up conversations that this approach yields large gains due to the joint modeling of the previous speech context and ASR uncertainty.
arXiv Detail & Related papers (2024-10-28T19:43:43Z)
Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models [66.24055500785657]
Traditional turn-based chat systems prevent users from verbally interacting with system while it is generating responses. To overcome these limitations, we adapt existing LLMs to listen users while generating output and provide users with instant feedback. We build a dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions.
arXiv Detail & Related papers (2024-06-22T03:20:10Z)
DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation [12.734839065028547]
This paper proposes a real-time cross-attention deep model named DeepVQE, based on residual convolutional neural networks (CNNs) and recurrent neural networks (RNNs) We conduct ablation studies analyze the contributions of different components of our model to achieve the overall performance. DeepVQE state-of-the-art performance on nonpersonalized tracks from the ICASSP 2023 Acoustic Echo Challenge and ICASSP 2023 Deep Noise Suppression Challenge test sets, showing that a single model can handle multiple tasks with excellent performance.
arXiv Detail & Related papers (2023-06-05T18:37:05Z)
Diffusion Recommender Model [85.9640416600725]
We propose a novel Diffusion Recommender Model (named DiffRec) to learn the generative process in a denoising manner.<n>To retain personalized information in user interactions, DiffRec reduces the added noises and avoids corrupting users' interactions into pure noises like in image synthesis.
arXiv Detail & Related papers (2023-04-11T04:31:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.