Related papers: ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

URL: http://arxiv.org/abs/2510.23558v1
Date: Mon, 27 Oct 2025 17:31:25 GMT
Title: ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
Authors: Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, Kai Yu,
Abstract summary: Large Audio Language Models (LALMs) extract and understand diverse information from audio.<n>LALMs are highly sensitive to how instructions are phrased, affecting instruction-following rates and task performance.<n>We introduce ISA-Bench, a benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition.
Score: 28.350243803500504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sensitive to how instructions are phrased, affecting both (i) instruction-following rates and (ii) task performance. Yet, no existing benchmarks offer a systematic and comprehensive evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition. We assess recent open-source and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy under controlled instruction variations. Experimental results reveal that even state-of-the-art LALMs suffer significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. To mitigate this issue, we fine-tune Qwen2-Audio on a specifically constructed complex instruction-variant dataset, achieving a marked improvement in instruction-following performance. However, this also induces nontrivial catastrophic forgetting: the model loses some previously mastered task capabilities when exposed to new instruction styles. Our benchmark provides a standardized basis for assessing and improving instruction sensitivity in LALMs, underscoring the need for instruction-robust audio understanding in real-world pipelines.

Related papers

Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance [29.349544663659938]
We show that perturbations in instruction-tuning data can enhance large language models' resistance against noisy instructions.<n>Surprisingly, our results suggest that instruction-tuning on perturbed instructions can, in some cases, improve downstream performance.
arXiv Detail & Related papers (2025-10-03T21:54:33Z)
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs [8.918587474371321]
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging.<n>We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs.<n>Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution.
arXiv Detail & Related papers (2025-09-09T15:30:40Z)
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z)
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
Benchmarking Prompt Sensitivity in Large Language Models [13.986971540998258]
Large language Models (LLMs) are highly sensitive to variations in prompt formulation.<n>This paper introduces a new task, Prompt Sensitivity Prediction, and a dataset designed to investigate the effects of slight prompt variations on LLM performance.
arXiv Detail & Related papers (2025-02-09T23:01:03Z)
Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.<n>We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.<n>We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [12.950770409452035]
We introduce two metrics for classification tasks, namely sensitivity and consistency.<n> sensitivity measures changes of predictions across rephrasings of the prompt.<n>Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z)
Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants [28.691691883519542]
We introduce a technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants. Based on DeMoRecon, we developed the FGIV dataset which contains fine-grained instruction variants of 1,773 seed instructions. Our findings show that LLMs fine-tuned with FGIV will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
arXiv Detail & Related papers (2024-06-17T08:08:11Z)
Contrastive Instruction Tuning [61.97704869248903]
We propose Contrastive Instruction Tuning to maximize the similarity between semantically equivalent instruction-instance pairs. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
arXiv Detail & Related papers (2024-02-17T00:09:32Z)
INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [59.07490387145391]
Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks. Their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories.
arXiv Detail & Related papers (2024-01-12T12:10:28Z)
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.