AHELM: A Holistic Evaluation of Audio-Language Models
- URL: http://arxiv.org/abs/2508.21376v2
- Date: Tue, 02 Sep 2025 17:58:21 GMT
- Title: AHELM: A Holistic Evaluation of Audio-Language Models
- Authors: Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang,
- Abstract summary: multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
- Score: 78.20477815156484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 6th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.
Related papers
- SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation [52.468945848774844]
This paper addresses the need for automated systems capable of evaluating audio separation without human intervention.<n>The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric.<n>SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation.
arXiv Detail & Related papers (2026-01-27T15:29:02Z) - AEQ-Bench: Measuring Empathy of Omni-Modal Large Models [55.722881748046895]
We introduce AEQ-Bench, a novel benchmark to assess two core empathetic capabilities of omni-modal large models (OLMs)<n>AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone.<n> Comprehensive assessment across linguistic and paralinguistic metrics reveals that OLMs trained with audio output capabilities generally outperformed models with text-only outputs.
arXiv Detail & Related papers (2026-01-15T15:39:50Z) - SpeechQualityLLM: LLM-Based Multimodal Assessment of Speech Quality [2.1178416840822027]
Speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale.<n>We introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs.<n>Our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics.
arXiv Detail & Related papers (2025-12-09T04:39:50Z) - VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models [32.086847480051084]
We present VoxEval, a novel SpeechQA benchmark that assesses knowledge understanding through pure speech interactions.<n>Our benchmark 1) maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format.
arXiv Detail & Related papers (2025-01-09T04:30:12Z) - AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.<n>Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.<n>We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Salmon: A Suite for Acoustic Language Model Evaluation [20.802090523583196]
We introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response.<n>We evaluate several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method.
arXiv Detail & Related papers (2024-09-11T17:34:52Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.