Related papers: Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

URL: http://arxiv.org/abs/2409.04927v3
Date: Wed, 2 Oct 2024 07:58:56 GMT
Title: Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
Authors: Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf,
Abstract summary: SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao. We show that SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.
Score: 41.10328851671422
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered reliably with correct speaker identification. The results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.

Related papers

VoiceAgentBench: Are Voice Assistants ready for agentic tasks? [5.639970295197759]
We introduce VoiceAgentBench, a benchmark to evaluate SpeechLMs in realistic spoken agentic settings.<n>It comprises over 5,500 synthetic spoken queries grounded in Indian context.<n>It measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness.
arXiv Detail & Related papers (2025-10-09T09:11:38Z)
Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data [46.12417789276609]
Speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence.<n>We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning.
arXiv Detail & Related papers (2025-09-20T09:26:40Z)
SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models [76.07833875692722]
Speech-based Intelligence Quotient (SIQ) is a new form of human cognition-inspired evaluation pipeline for voice understanding large language models.<n>Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks.
arXiv Detail & Related papers (2025-07-25T15:12:06Z)
What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z)
Language-agnostic, automated assessment of listeners' speech recall using large language models [0.0]
This research leverages modern large language models (LLMs) in native English speakers and native speakers of 10 other languages. Participants listened to and freely recalled short stories (in quiet/clear and in babble noise) in their native language. LLMs prompt engineering with semantic similarity analyses to score speech recall revealed sensitivity to known effects of temporal order, primacy/recency, and background noise.
arXiv Detail & Related papers (2025-03-02T22:28:41Z)
Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z)
Question-Interlocutor Scope Realized Graph Modeling over Key Utterances for Dialogue Reading Comprehension [61.55950233402972]
We propose a new key utterances extracting method for dialogue reading comprehension. It performs prediction on the unit formed by several contiguous utterances, which can realize more answer-contained utterances. As a graph constructed on the text of utterances, we then propose Question-Interlocutor Scope Realized Graph (QuISG) modeling.
arXiv Detail & Related papers (2022-10-26T04:00:42Z)
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows. Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z)
Self-supervised Dialogue Learning for Spoken Conversational Question Answering [29.545937716796082]
In spoken conversational question answering (SCQA), the answer to the corresponding question is generated by retrieving and then analyzing a fixed spoken document, including multi-part conversations. We introduce a self-supervised learning approach, including incoherence discrimination, insertion detection, and question prediction, to explicitly capture the coreference resolution and dialogue coherence. Our proposed method provides more coherent, meaningful, and appropriate responses, yielding superior performance gains compared to the original pre-trained language models.
arXiv Detail & Related papers (2021-06-04T00:09:38Z)
Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering [63.72278693825945]
Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow. We propose CADNet, a novel contextualized attention-based distillation approach. We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance.
arXiv Detail & Related papers (2020-10-21T15:17:18Z)
Towards Data Distillation for End-to-end Spoken Conversational Question Answering [65.124088336738]
We propose a new Spoken Conversational Question Answering task (SCQA) SCQA aims at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora. Our main objective is to build a QA system to deal with conversational questions both in spoken and text forms.
arXiv Detail & Related papers (2020-10-18T05:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.