Related papers: Pardon? Evaluating Conversational Repair in Large Audio-Language Models

Pardon? Evaluating Conversational Repair in Large Audio-Language Models

URL: http://arxiv.org/abs/2601.12973v1
Date: Mon, 19 Jan 2026 11:36:27 GMT
Title: Pardon? Evaluating Conversational Repair in Large Audio-Language Models
Authors: Shuanghong Huang, Jinlei Xu, Youchao Zhou, Yanghao Zhou, Xuan Zhao, Chong Feng, Wenxuan Zhang,
Abstract summary: We introduce a repair-aware evaluation setting that distinguishes between answerable and unanswerable audio inputs.<n>We propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions.
Score: 15.682992943165994
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction.

Related papers

When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [8.391356566325054]
Large language models (LLMs) often respond even when prompts omit critical details or include misleading information.<n>We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance.<n>We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints.
arXiv Detail & Related papers (2026-02-04T02:21:01Z)
AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering [97.52852990265136]
We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
arXiv Detail & Related papers (2026-01-21T07:35:36Z)
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering [58.04745279785462]
AQUA-Bench is a benchmark for Audio Question Unanswerability Assessment.<n>It systematically evaluates three scenarios: Absent Answer Detection, Incompatible Answer Set Detection, and Incompatible Audio Question Detection.<n>By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability.
arXiv Detail & Related papers (2026-01-18T03:55:28Z)
AEQ-Bench: Measuring Empathy of Omni-Modal Large Models [55.722881748046895]
We introduce AEQ-Bench, a novel benchmark to assess two core empathetic capabilities of omni-modal large models (OLMs)<n>AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone.<n> Comprehensive assessment across linguistic and paralinguistic metrics reveals that OLMs trained with audio output capabilities generally outperformed models with text-only outputs.
arXiv Detail & Related papers (2026-01-15T15:39:50Z)
CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering [9.50840225852638]
Conditional Ambiguous Question-Answering (CondAmbigQA) is a benchmark comprising 2,000 ambiguous queries and condition-aware evaluation metrics.<n>Experiments demonstrate that models considering conditions before answering improve answer accuracy by 11.75%, with an additional 7.15% gain when conditions are explicitly provided.
arXiv Detail & Related papers (2025-02-03T17:01:51Z)
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models [32.086847480051084]
We present VoxEval, a novel SpeechQA benchmark that assesses knowledge understanding through pure speech interactions.<n>Our benchmark 1) maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format.
arXiv Detail & Related papers (2025-01-09T04:30:12Z)
NoisyEQA: Benchmarking Embodied Question Answering Against Noisy Queries [16.283468528293568]
We introduce a NoisyEQA benchmark designed to evaluate an agent's ability to recognize and correct noisy questions.<n>This benchmark introduces four common types of noise found in real-world applications: Latent Hallucination Noise, Memory Noise, Perception Noise, and Semantic Noise.<n>We also propose a 'Self-Correction' prompting mechanism and a new evaluation metric to enhance and measure both noise detection capability and answer quality.
arXiv Detail & Related papers (2024-12-14T07:52:24Z)
Accurate and Nuanced Open-QA Evaluation Through Textual Entailment [4.762213968673381]
We propose to study the entailment relations of answers to identify more informative and more general system answers. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
arXiv Detail & Related papers (2024-05-26T21:33:27Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems [59.1250765143521]
Current knowledge-grounded dialogue systems often fail to align the generated responses with human-preferred qualities. We propose Polished & Informed Candidate Scoring (PICK), a generation re-scoring framework. We demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history.
arXiv Detail & Related papers (2023-09-19T08:27:09Z)
Improving Factual Consistency Between a Response and Persona Facts [64.30785349238619]
Neural models for response generation produce responses that are semantically plausible but not necessarily factually consistent with facts describing the speaker's persona. We propose to fine-tune these models by reinforcement learning and an efficient reward function that explicitly captures the consistency between a response and persona facts as well as semantic plausibility.
arXiv Detail & Related papers (2020-04-30T18:08:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.