Related papers: Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation

Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation

URL: http://arxiv.org/abs/2601.13742v2
Date: Sat, 24 Jan 2026 16:56:49 GMT
Title: Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation
Authors: Arjun Chandra, Kevin Miller, Venkatesh Ravichandran, Constantinos Papayiannis, Venkatesh Saligrama,
Abstract summary: Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content.<n>We propose TRACE, a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation.<n>We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
Score: 19.92868268408954
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.

Related papers

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement [74.51476422119457]
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction.<n>We propose textttSageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation.
arXiv Detail & Related papers (2025-08-28T15:47:37Z)
MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols [46.82669096251444]
MTalk-Bench is a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound.<n>Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and evaluation (absolute scoring) for relative and absolute assessment.<n>Results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.
arXiv Detail & Related papers (2025-08-22T12:14:17Z)
SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models [76.07833875692722]
Speech-based Intelligence Quotient (SIQ) is a new form of human cognition-inspired evaluation pipeline for voice understanding large language models.<n>Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks.
arXiv Detail & Related papers (2025-07-25T15:12:06Z)
Audio-Aware Large Language Models as Judges for Speaking Styles [123.36224336701237]
We explore using audio-aware large language models (ALLMs) as an automatic judge to assess the speaking styles of speeches.<n>We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses.<n>Our results show that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.
arXiv Detail & Related papers (2025-06-06T11:05:48Z)
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z)
S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models [14.060679420379516]
End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens.<n>This often leads to a decline in reasoning and generation performance compared to text input.<n>We propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs.
arXiv Detail & Related papers (2025-05-20T14:42:20Z)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings.<n>This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation.<n>We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.