BRACE: A Benchmark for Robust Audio Caption Quality Evaluation
- URL: http://arxiv.org/abs/2512.10403v1
- Date: Thu, 11 Dec 2025 08:09:24 GMT
- Title: BRACE: A Benchmark for Robust Audio Caption Quality Evaluation
- Authors: Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, Wentao Zhang,
- Abstract summary: BRACE is a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting.<n>BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content.<n>We evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs.
- Score: 23.704921982469063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Related papers
- Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation [79.13636675697096]
Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS)<n>MQA-RefAVS is a task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations.<n>We propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information.
arXiv Detail & Related papers (2026-02-03T07:47:59Z) - Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation [19.92868268408954]
Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content.<n>We propose TRACE, a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation.<n>We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
arXiv Detail & Related papers (2026-01-20T08:57:02Z) - AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration [28.117814524373667]
AXIOM is a novel perturbation-based framework for synthesizing code evaluation benchmarks at scale.<n>It reframes program scores as the refinement effort needed for deployment.
arXiv Detail & Related papers (2025-12-23T08:39:22Z) - AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation [16.047087043580053]
Multimodal Large Language Models (MLLMs) have been widely applied in speech and music.<n>Unlike semantic-only text tokens, audio tokens must both capture global semantic content and preserve fine-grained acoustic details.<n>This paper provides suitable definitions for semantic and acoustic tokens and introduces a systematic evaluation framework.
arXiv Detail & Related papers (2025-09-02T14:15:22Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews [0.9314555897827079]
Systematic literature reviews aim to identify and evaluate all relevant papers on a topic.<n>To date, abstract screening methods using large language models (LLMs) focus on binary classification settings.<n>We propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker.
arXiv Detail & Related papers (2025-05-30T16:18:50Z) - Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models [63.53530525014976]
ZIQI-Eval is a benchmark specifically designed to evaluate the music-related capabilities of large language models (LLMs)
ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries.
Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities.
arXiv Detail & Related papers (2024-06-22T16:24:42Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.