Related papers: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

URL: http://arxiv.org/abs/2505.07591v1
Date: Mon, 12 May 2025 14:16:55 GMT
Title: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
Authors: Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, Xuanjing Huang,
Abstract summary: We develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting.<n>We evaluate 19 large language models and uncover substantial variation in performance across constraint forms.<n>In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters.
Score: 48.361839372110246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction following evaluates large language models (LLMs) on their ability to generate outputs that adhere to user-defined constraints. However, existing benchmarks often rely on templated constraint prompts, which lack the diversity of real-world usage and limit fine-grained performance assessment. To fill this gap, we propose a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Building on this framework, we develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting, yielding 1,200 code-verifiable instruction-following test samples. We evaluate 19 LLMs across seven model families and uncover substantial variation in performance across constraint forms. For instance, average performance drops from 77.67% at Level I to 32.96% at Level IV. Furthermore, we demonstrate the utility of our approach by using it to generate data for reinforcement learning, achieving substantial gains in instruction following without degrading general performance. In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters, which enhance constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.

Related papers

Generalizing Verifiable Instruction Following [44.02178200187706]
A crucial factor for successful human and AI interaction is the ability of language models to follow human instructions precisely.<n>We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities.<n>We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints.
arXiv Detail & Related papers (2025-07-03T17:44:33Z)
A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback [30.446511584123492]
Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored.<n>We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions.<n>We synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants.
arXiv Detail & Related papers (2025-07-01T11:51:40Z)
RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data [37.631782007066214]
RECAST is a novel framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks.<n>We construct RECAST-30K, a large-scale, high-quality dataset comprising 30k instances spanning 15 constraint types.<n> Experimental results demonstrate that models fine-tuned on RECAST-30K show substantial improvements in following complex instructions.
arXiv Detail & Related papers (2025-05-25T08:31:08Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
WildIFEval: Instruction Following in the Wild [4.5214954812238295]
We introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions.<n>Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts.<n>We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios.
arXiv Detail & Related papers (2025-03-09T12:06:29Z)
Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models [39.114513139453756]
It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints.<n>We design a pipeline to construct datasets with high-quality outputs automatically.<n>To fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method.<n>We experimentally evaluate the effectiveness of our methods in improving LLMs' soft constraint following ability.
arXiv Detail & Related papers (2025-01-09T03:34:07Z)
Multi-Attribute Constraint Satisfaction via Language Model Rewriting [67.5778646504987]
Multi-Attribute Constraint Satisfaction (MACS) is a method capable of finetuning language models to satisfy user-specified constraints on multiple external real-value attributes.<n>Our work opens new avenues for generalized and real-value multi-attribute control, with implications for diverse applications spanning NLP and bioinformatics.
arXiv Detail & Related papers (2024-12-26T12:36:39Z)
How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models. We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.