AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
- URL: http://arxiv.org/abs/2601.12248v1
- Date: Sun, 18 Jan 2026 03:55:28 GMT
- Title: AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
- Authors: Chun-Yi Kuan, Hung-yi Lee,
- Abstract summary: AQUA-Bench is a benchmark for Audio Question Unanswerability Assessment.<n>It systematically evaluates three scenarios: Absent Answer Detection, Incompatible Answer Set Detection, and Incompatible Audio Question Detection.<n>By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability.
- Score: 58.04745279785462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.
Related papers
- A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models [3.770636357625305]
The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation.<n>We propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.
arXiv Detail & Related papers (2026-01-27T14:54:10Z) - Pardon? Evaluating Conversational Repair in Large Audio-Language Models [15.682992943165994]
We introduce a repair-aware evaluation setting that distinguishes between answerable and unanswerable audio inputs.<n>We propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions.
arXiv Detail & Related papers (2026-01-19T11:36:27Z) - MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark [64.89810922949984]
We introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks.<n>MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips.<n>We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks.
arXiv Detail & Related papers (2025-09-26T15:12:46Z) - Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.84031769492708]
This task defines three QA subsets to test audio-language models on interactive question-answering over diverse acoustic scenes.<n>Preliminary results on the development set are compared, showing strong variation across models and subsets.<n>This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity.
arXiv Detail & Related papers (2025-05-12T09:04:16Z) - Audiopedia: Audio QA with Knowledge [0.0]
We introduce Audiopedia, a novel task called Audio Question Answering with Knowledge.<n>Unlike traditional Audio Question Answering (AQA) benchmarks that focus on simple queries answerable from audio alone, Audiopedia targets knowledge-intensive questions.<n>We benchmark large audio language models (LALMs) on these sub-tasks and observe suboptimal performance.<n>We propose a generic framework that can be adapted to any LALM, equipping them with knowledge reasoning capabilities.
arXiv Detail & Related papers (2024-12-29T23:48:35Z) - Don't Just Say "I don't know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations [70.6395572287422]
Self-alignment method is capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions.
We conduct disparity-driven self-curation to select qualified data for fine-tuning the LLM itself for aligning the responses to unknown questions as desired.
arXiv Detail & Related papers (2024-02-23T02:24:36Z) - Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that
Don't have a Definitive Answer? [43.03399918557937]
In real-world applications, users often ask questions that don't have a definitive answer.
We introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers.
With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions.
We show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline.
arXiv Detail & Related papers (2023-09-08T23:12:03Z) - Selectively Answering Ambiguous Questions [38.83930394700588]
We find that the most reliable approach to decide when to abstain involves quantifying repetition within sampled model outputs.
Our results suggest that sampling-based confidence scores help calibrate answers to relatively unambiguous questions.
arXiv Detail & Related papers (2023-05-24T01:25:38Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - NoiseQA: Challenge Set Evaluation for User-Centric Question Answering [68.67783808426292]
We show that components in the pipeline that precede an answering engine can introduce varied and considerable sources of error.
We conclude that there is substantial room for progress before QA systems can be effectively deployed.
arXiv Detail & Related papers (2021-02-16T18:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.