Related papers: AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

URL: http://arxiv.org/abs/2601.14728v1
Date: Wed, 21 Jan 2026 07:35:36 GMT
Title: AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering
Authors: Chun-Yi Kuan, Kai-Wei Chang, Hung-yi Lee,
Abstract summary: We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
Score: 97.52852990265136
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.

Related papers

SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation [55.26111461168754]
We introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching.<n>It is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
arXiv Detail & Related papers (2025-11-21T17:30:18Z)
AURA Score: A Metric For Holistic Audio Question Answering Evaluation [57.042210272137396]
We introduce AQEval to enable systematic benchmarking of AQA metrics.<n>It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance.<n>Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment.<n>Third, we propose a new metric - AURA score, to better evaluate open-ended model responses.
arXiv Detail & Related papers (2025-10-06T15:41:34Z)
Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens [16.10999154707507]
We propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens.<n> TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens.<n>Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.
arXiv Detail & Related papers (2025-09-24T18:55:18Z)
SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z)
Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z)
Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence [11.217656140423207]
ASAC aims to evaluate the overall speaking proficiency of an L2 speaker in a setting where an interlocutor interacts with one or more candidates.<n>We propose a hierarchical graph model that aptly incorporates both broad inter-response interactions and nuanced semantic information.<n>Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy.
arXiv Detail & Related papers (2024-09-11T07:24:07Z)
STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text. We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively. We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z)
DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts. We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.