Robustness assessment of large audio language models in multiple-choice evaluation
- URL: http://arxiv.org/abs/2510.04584v1
- Date: Mon, 06 Oct 2025 08:36:17 GMT
- Title: Robustness assessment of large audio language models in multiple-choice evaluation
- Authors: Fernando López, Santosh Kesiraju, Jordi Luque,
- Abstract summary: We conduct a systematic study spanning three benchmarks (MMAU, MMAR and MMSU) and four models.<n>Our findings indicate that models are sensitive not only to the ordering of choices, but also to the paraphrasing of the question and the choices.
- Score: 43.42989171223751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in substantially different results. Existing MCQA frameworks do not account for this variability and report a single accuracy number per benchmark or category. We dive into the MCQA evaluation framework and conduct a systematic study spanning three benchmarks (MMAU, MMAR and MMSU) and four models: Audio Flamingo 2, Audio Flamingo 3, Qwen2.5-Omni-7B-Instruct, and Kimi-Audio-7B-Instruct. Our findings indicate that models are sensitive not only to the ordering of choices, but also to the paraphrasing of the question and the choices. Finally, we propose a simpler evaluation protocol and metric that account for subtle variations and provide a more detailed evaluation report of LALMs within the MCQA framework.
Related papers
- SpeechQualityLLM: LLM-Based Multimodal Assessment of Speech Quality [2.1178416840822027]
Speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale.<n>We introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs.<n>Our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics.
arXiv Detail & Related papers (2025-12-09T04:39:50Z) - MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark [64.89810922949984]
We introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks.<n>MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips.<n>We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks.
arXiv Detail & Related papers (2025-09-26T15:12:46Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - Metric assessment protocol in the context of answer fluctuation on MCQ tasks [4.453107218424601]
Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently.<n>Previous research has not conducted a thorough assessment of them.<n>We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates.
arXiv Detail & Related papers (2025-07-21T13:01:46Z) - Reasoning Models are Test Exploiters: Rethinking Multiple-Choice [12.317748510370238]
Large Language Models (LLMs) are asked to choose among a fixed set of choices.<n>Multiple-choice question-answering (McQCA) is a good proxy for the downstream performance of models.<n>This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models.
arXiv Detail & Related papers (2025-07-21T07:49:32Z) - MMMOS: Multi-domain Multi-axis Audio Quality Assessment [49.48516314472825]
Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech.<n>We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness.<n> MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's tau versus baseline.
arXiv Detail & Related papers (2025-07-05T16:42:09Z) - SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction [1.8862680628828246]
Evaluation of voice synthesis can be done using objective metrics or subjective metrics.<n>Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5.<n>We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU)
arXiv Detail & Related papers (2025-06-02T10:45:40Z) - Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.