Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
- URL: http://arxiv.org/abs/2503.01174v1
- Date: Mon, 03 Mar 2025 04:46:04 GMT
- Title: Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
- Authors: Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Shinji Watanabe,
- Abstract summary: We propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities.<n>We present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events.<n>We will open source our evaluation platform to promote the development of advanced conversational AI systems.
- Score: 54.03209351287654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.
Related papers
- Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key conversational behaviors.
We aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue.<n>This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Turn-taking and Backchannel Prediction with Acoustic and Large Language
Model Fusion [38.78341787348164]
We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM)
Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality.
arXiv Detail & Related papers (2024-01-26T08:59:07Z) - AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs [27.122094554340194]
We extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities.
The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation.
arXiv Detail & Related papers (2023-11-12T06:56:14Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.