Related papers: SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

URL: http://arxiv.org/abs/2601.04029v1
Date: Wed, 07 Jan 2026 15:45:41 GMT
Title: SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency
Authors: Jonggeun Lee, Junseong Pyo, Gyuhyeon Seo, Yohan Jo,
Abstract summary: We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues.<n>We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech.<n>We find that models struggle to reliably detect acoustic inconsistencies.
Score: 12.420484491347073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn conversations remains unexplored. We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating nine widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors' turns are provided together, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in choosing the audio that best matches the speaker among several acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges.

Related papers

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception [142.4692205981783]
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception.<n>Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance.<n>To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable primitive proves effective in preventing the model from being derailed by flawed external candidates.
arXiv Detail & Related papers (2026-01-14T12:06:50Z)
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction [12.216811577733125]
We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns.<n>We introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking.<n>We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline.
arXiv Detail & Related papers (2025-12-16T19:26:44Z)
AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z)
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z)
Speaker Verification in Agent-Generated Conversations [47.6291644653831]
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. This study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker.
arXiv Detail & Related papers (2024-05-16T14:46:18Z)
Pre-Finetuning for Few-Shot Emotional Speech Recognition [20.894029832911617]
We view speaker adaptation as a few-shot learning problem. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z)
In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z)
Automatic Evaluation of Speaker Similarity [0.0]
We introduce a new automatic evaluation method for speaker similarity assessment, consistent with human perceptual scores. Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.
arXiv Detail & Related papers (2022-07-01T11:23:16Z)
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z)
Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.