Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation
- URL: http://arxiv.org/abs/2601.13742v2
- Date: Sat, 24 Jan 2026 16:56:49 GMT
- Title: Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation
- Authors: Arjun Chandra, Kevin Miller, Venkatesh Ravichandran, Constantinos Papayiannis, Venkatesh Saligrama,
- Abstract summary: Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content.<n>We propose TRACE, a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation.<n>We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
- Score: 19.92868268408954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
Related papers
- SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement [74.51476422119457]
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction.<n>We propose textttSageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation.
arXiv Detail & Related papers (2025-08-28T15:47:37Z) - MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols [46.82669096251444]
MTalk-Bench is a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound.<n>Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and evaluation (absolute scoring) for relative and absolute assessment.<n>Results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.
arXiv Detail & Related papers (2025-08-22T12:14:17Z) - SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models [76.07833875692722]
Speech-based Intelligence Quotient (SIQ) is a new form of human cognition-inspired evaluation pipeline for voice understanding large language models.<n>Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks.
arXiv Detail & Related papers (2025-07-25T15:12:06Z) - Audio-Aware Large Language Models as Judges for Speaking Styles [123.36224336701237]
We explore using audio-aware large language models (ALLMs) as an automatic judge to assess the speaking styles of speeches.<n>We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses.<n>Our results show that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.
arXiv Detail & Related papers (2025-06-06T11:05:48Z) - Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z) - S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models [14.060679420379516]
End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens.<n>This often leads to a decline in reasoning and generation performance compared to text input.<n>We propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs.
arXiv Detail & Related papers (2025-05-20T14:42:20Z) - Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings.<n>This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation.<n>We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.