WildIFEval: Instruction Following in the Wild
- URL: http://arxiv.org/abs/2503.06573v1
- Date: Sun, 09 Mar 2025 12:06:29 GMT
- Title: WildIFEval: Instruction Following in the Wild
- Authors: Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor,
- Abstract summary: We introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions.<n>Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts.<n>We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios.
- Score: 4.5214954812238295
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
Related papers
- Federated Continual Instruction Tuning [39.344583304181135]
Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training.
We introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge.
Our proposed method significantly enhances model performance across varying levels of data and catastrophic forgetting.
arXiv Detail & Related papers (2025-03-17T07:58:06Z) - Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following [39.114513139453756]
Real-world instructions with multiple constraints pose a significant challenge to existing large language models (LLMs)<n>We quantitatively measure the difficulty distribution of the constraints by a novel Difficulty Distribution Index (CDDI)<n>We find that LLMs are more performant when presented with the constraints in a hard-to-easy'' order.
arXiv Detail & Related papers (2025-02-24T14:39:28Z) - The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples.<n>We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts.<n>Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z) - WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models [67.15146980023621]
We propose WarriorCoder, a novel paradigm learns from expert battles to address limitations of current approaches.<n>We create an arena where leading expert code LLMs challenge each other, with evaluations conducted by impartial judges.<n>This competitive framework generates novel training data from scratch, leveraging the strengths of all participants.
arXiv Detail & Related papers (2024-12-23T08:47:42Z) - Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning [55.265138447400744]
Statement-Tuning is a technique that models discriminative tasks as a set of finite statements and trains an encoder model to discriminate between the potential statements to determine the label.
Experimental results demonstrate that Statement-Tuning achieves competitive performance compared to state-of-the-art LLMs with significantly fewer parameters.
The study investigates the impact of several design choices on few-shot and zero-shot generalization, revealing that Statement-Tuning can achieve strong performance with modest training data.
arXiv Detail & Related papers (2024-04-19T14:05:03Z) - Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models [23.17547206140014]
We introduce Conifer, an instruction tuning dataset for large language models.
We train models with Conifer to follow instructions with complex constraints.
On several instruction-following benchmarks, our 7B model outperforms the state-of-the-art open-source 7B models.
arXiv Detail & Related papers (2024-04-03T15:55:39Z) - CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model [121.23360004498893]
We present a benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm.
Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting.
We introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment.
arXiv Detail & Related papers (2024-03-13T08:54:31Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - A Simple and Effective Framework for Strict Zero-Shot Hierarchical
Classification [23.109264015761873]
Large language models (LLMs) have achieved strong performance on benchmark tasks, especially in zero or few-shot settings.
We propose a more indicative long-tail prediction task for hierarchical datasets.
Our method does not require any updates, a resource-intensive process and achieves strong performance across multiple datasets.
arXiv Detail & Related papers (2023-05-24T16:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.