Related papers: How Many Instructions Can LLMs Follow at Once?

How Many Instructions Can LLMs Follow at Once?

URL: http://arxiv.org/abs/2507.11538v1
Date: Tue, 15 Jul 2025 17:59:42 GMT
Title: How Many Instructions Can LLMs Follow at Once?
Authors: Daniel Jaroslawicz, Brendan Whiting, Parth Shah, Karime Maamari,
Abstract summary: We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases.<n>We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions.<n>Our insights can help inform design of instruction-dense prompts in real-world applications and highlight important performance-latency tradeoffs.
Score: 0.16874375111244325
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Production-grade LLM systems require robust adherence to dozens or even hundreds of instructions simultaneously. However, the instruction-following capabilities of LLMs at high instruction densities have not yet been characterized, as existing benchmarks only evaluate models on tasks with a single or few instructions. We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases. We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions. Our analysis reveals model size and reasoning capability to correlate with 3 distinct performance degradation patterns, bias towards earlier instructions, and distinct categories of instruction-following errors. Our insights can help inform design of instruction-dense prompts in real-world applications and highlight important performance-latency tradeoffs. We open-source the benchmark and all results for further analysis at https://distylai.github.io/IFScale.

Related papers

Training with Pseudo-Code for Instruction Following [4.7188893422904]
We take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code.<n>We propose fine-tuning Large Language Models with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code.<n>We conduct rigorous experiments with $5$ different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning.
arXiv Detail & Related papers (2025-05-23T15:14:29Z)
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
KCIF: Knowledge-Conditioned Instruction Following [4.945902994386117]
We study the interaction between knowledge and instruction following, and observe that LLMs struggle to follow simple answer modifying instructions.<n>Our results highlight a limitation in the traditional separation of knowledge/reasoning and instruction following, and suggest that joint-study of these capabilities are important.
arXiv Detail & Related papers (2024-10-16T19:07:37Z)
Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.<n>Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z)
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z)
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs [47.94710556156627]
MIA-Bench is a benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions.<n>Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions.
arXiv Detail & Related papers (2024-07-01T17:53:35Z)
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models [48.455388608863785]
We introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following tasks. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules) More recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness.
arXiv Detail & Related papers (2024-06-28T15:34:26Z)
Instruction-Following Evaluation for Large Language Models [52.90926820437014]
We introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. We show evaluation results of two widely available LLMs on the market.
arXiv Detail & Related papers (2023-11-14T05:13:55Z)
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning [111.01953096869947]
Visual instruction tuning is crucial for enhancing the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)<n>We develop a systematic approach to automatically create high-quality complex visual reasoning instructions.<n> Experimental results consistently demonstrate the enhanced performance of all compared MLLMs.
arXiv Detail & Related papers (2023-11-02T15:36:12Z)
Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models [91.02730155418699]
Large language models (LLMs) can perform a wide range of tasks by following natural language instructions. We introduce Auto-Instruct, a novel method to automatically improve the quality of instructions provided to LLMs. In experiments on 118 out-of-domain tasks, Auto-Instruct surpasses both human-written instructions and existing baselines of LLM-generated instructions.
arXiv Detail & Related papers (2023-10-19T19:52:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.