Related papers: AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

URL: http://arxiv.org/abs/2509.08031v2
Date: Thu, 11 Sep 2025 16:27:59 GMT
Title: AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
Authors: Sidharth Surapaneni, Hoang Nguyen, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Akshay Kalkunte, Sai Rajeswar, Sathwik Tejaswi Madhusudhan,
Abstract summary: Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging.<n>We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs.<n>Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution.
Score: 8.918587474371321
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

Related papers

ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models [28.350243803500504]
Large Audio Language Models (LALMs) extract and understand diverse information from audio.<n>LALMs are highly sensitive to how instructions are phrased, affecting instruction-following rates and task performance.<n>We introduce ISA-Bench, a benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition.
arXiv Detail & Related papers (2025-10-27T17:31:25Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling [1.219841051166348]
In this paper, we explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks.<n>We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs.
arXiv Detail & Related papers (2025-05-28T12:28:18Z)
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z)
Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning [30.308743810639758]
Large audio language models (LALMs) have to be evaluated on reasoning related tasks which are different from traditional classification or generation tasks.<n>We benchmark open-source LALMs and observe that they are consistently behind human capabilities on the tasks in the TREA dataset.<n>Our analysis shows that the accuracy and uncertainty metrics are not necessarily correlated and thus, points to a need for wholesome evaluation of LALMs for high-stakes applications.
arXiv Detail & Related papers (2025-05-19T13:46:35Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
PanguIR Technical Report for NTCIR-18 AEOLLM Task [12.061652026366591]
Large language models (LLMs) are increasingly critical and challenging to evaluate.<n>Manual evaluation, while comprehensive, is often costly and resource-intensive.<n>automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria.
arXiv Detail & Related papers (2025-03-04T07:40:02Z)
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models [0.0]
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models.<n>This paper introduces THaMES, an integrated framework and library addressing this gap.<n> THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs.
arXiv Detail & Related papers (2024-09-17T16:55:25Z)
CoMMIT: Coordinated Multimodal Instruction Tuning [90.1532838391285]
Multimodal large language models (MLLMs) generally involve cooperative learning between a backbone LLM and a feature encoder of non-text input modalities.<n>In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.<n>We propose a Multimodal Balance Coefficient that enables quantitative measurement of the balance of learning.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses. Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z)
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems. Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.