Related papers: Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks

Related papers

IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs [27.291294878333765]
We propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks.<n>We turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.
arXiv Detail & Related papers (2025-07-27T10:11:16Z)
How Many Instructions Can LLMs Follow at Once? [0.16874375111244325]
We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases.<n>We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions.<n>Our insights can help inform design of instruction-dense prompts in real-world applications and highlight important performance-latency tradeoffs.
arXiv Detail & Related papers (2025-07-15T17:59:42Z)
Minerva: A Programmable Memory Test Benchmark for Language Models [18.474144165594225]
We present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively.<n>We evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory.<n>We also design composite tests to investigate the models' ability to perform more complex, integrated tasks.
arXiv Detail & Related papers (2025-02-05T16:53:45Z)
LLMs can be easily Confused by Instructional Distractions [16.060402139507644]
Large language models show exceptional skill in instruction following tasks. This strength can turn into a vulnerability when the models are required to disregard certain instructions. We introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction.
arXiv Detail & Related papers (2025-02-05T04:52:57Z)
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples.<n>We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts.<n>Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z)
Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models [8.020688053947547]
One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions. This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields. We have noted that LLMs can become easily distracted by instruction-formatted statements, which may lead to an oversight of their instruction comprehension skills.
arXiv Detail & Related papers (2024-12-27T04:37:39Z)
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models [48.455388608863785]
We introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following tasks. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules) More recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness.
arXiv Detail & Related papers (2024-06-28T15:34:26Z)
Deconstructing In-Context Learning: Understanding Prompts via Corruption [13.37109575313212]
We decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models are more sensitive to the semantics of the prompt.
arXiv Detail & Related papers (2024-04-02T15:50:55Z)
Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning [13.535110749767451]
We propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG) Our method computes the information gain on masked parts to dynamically replay data and refine the training objective. Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
arXiv Detail & Related papers (2024-03-15T06:54:20Z)
CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model [121.23360004498893]
We present a benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm. Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting. We introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment.
arXiv Detail & Related papers (2024-03-13T08:54:31Z)
KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions [63.307317584926146]
Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. We construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain.
arXiv Detail & Related papers (2024-03-06T17:16:44Z)
Large Language Models can Learn Rules [106.40747309894236]
We present Hypotheses-to-Theories (HtT), a framework that learns a rule library for reasoning with large language models (LLMs) Experiments on relational reasoning, numerical reasoning and concept learning problems show that HtT improves existing prompting methods. The learned rules are also transferable to different models and to different forms of the same problem.
arXiv Detail & Related papers (2023-10-10T23:07:01Z)
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z)
Eva-KELLM: A New Benchmark for Evaluating Knowledge Editing of LLMs [54.22416829200613]
Eva-KELLM is a new benchmark for evaluating knowledge editing of large language models. Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results.
arXiv Detail & Related papers (2023-08-19T09:17:19Z)
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z)
Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning [74.70157466822612]
We systematically study the role of task definitions in instruction learning. We find that model performance drops substantially when removing contents describing the task output. We propose two strategies to help models better leverage task instructions.
arXiv Detail & Related papers (2023-06-01T21:11:24Z)
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions. We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)
Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following [44.701091969256055]
We present our finding that prepending a Task-Agnostic Prefix Prompt (TAPP) to the input improves the instruction-following ability of various Large Language Models (LLMs) during inference. We observe that both base LLMs (i.e. not fine-tuned to follow instructions) and instruction-tuned models benefit from TAPP, resulting in 34.58% and 12.26% improvement on average.
arXiv Detail & Related papers (2023-02-28T16:06:35Z)
Large Language Models Are Human-Level Prompt Engineers [31.98042013940282]
We propose Automatic Prompt Engineer for automatic instruction generation and selection. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness.
arXiv Detail & Related papers (2022-11-03T15:43:03Z)
Few-shot Learning with Retrieval Augmented Language Models [75.63572749426473]
Large language models have shown impressive few-shot results on a wide range of tasks. When knowledge is key to such results, massive parameter counts to store knowledge seem to be needed. We present Atlas, a carefully designed and pre-trained retrieval augmented language model able to learn knowledge intensive tasks with very few training examples.
arXiv Detail & Related papers (2022-08-05T17:39:22Z)
CLUES: Few-Shot Learning Evaluation in Natural Language Understanding [81.63968985419982]
We introduce CLUES, a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks.
arXiv Detail & Related papers (2021-11-04T00:43:15Z)
Finetuned Language Models Are Zero-Shot Learners [67.70352207685558]
We show that instruction tuning boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types.
arXiv Detail & Related papers (2021-09-03T17:55:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.