Related papers: When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

URL: http://arxiv.org/abs/2509.21051v1
Date: Thu, 25 Sep 2025 12:01:32 GMT
Title: When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following
Authors: Keno Harada, Yudai Yamazaki, Masachika Taniguchi, Edison Marrese-Taylor, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo,
Abstract summary: We introduce two specialized benchmarks for fundamental domains where multiple instructions following is important.<n>We show that performance consistently degrades as the number of instructions increases.<n>We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error.
Score: 42.08242599538887
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.

Related papers

How Many Instructions Can LLMs Follow at Once? [0.16874375111244325]
We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases.<n>We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions.<n>Our insights can help inform design of instruction-dense prompts in real-world applications and highlight important performance-latency tradeoffs.
arXiv Detail & Related papers (2025-07-15T17:59:42Z)
Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.<n>Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z)
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z)
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs [47.94710556156627]
MIA-Bench is a benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions.<n>Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions.
arXiv Detail & Related papers (2024-07-01T17:53:35Z)
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models [48.455388608863785]
We introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following tasks. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules) More recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness.
arXiv Detail & Related papers (2024-06-28T15:34:26Z)
Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants [28.691691883519542]
We introduce a technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants. Based on DeMoRecon, we developed the FGIV dataset which contains fine-grained instruction variants of 1,773 seed instructions. Our findings show that LLMs fine-tuned with FGIV will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
arXiv Detail & Related papers (2024-06-17T08:08:11Z)
CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model [121.23360004498893]
We present a benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm. Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting. We introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment.
arXiv Detail & Related papers (2024-03-13T08:54:31Z)
Towards Robust Instruction Tuning on Multimodal Large Language Models [25.506776502317436]
In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. Results on two popular multimodal instructionfollowing benchmarks show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks.
arXiv Detail & Related papers (2024-02-22T12:35:50Z)
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning [24.741736629886564]
Instruction tuning is a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions. We introduce MUL-TIINSTRUCT, the first multimodal instruction tuning benchmark dataset. We show strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from a text-only instruction dataset.
arXiv Detail & Related papers (2022-12-21T05:17:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.