Related papers: Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

URL: http://arxiv.org/abs/2407.03978v2
Date: Thu, 11 Jul 2024 06:44:47 GMT
Title: Benchmarking Complex Instruction-Following with Multiple Constraints Composition
Authors: Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, Minlie Huang,
Abstract summary: How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints. We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
Score: 72.82640456309821
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.

Related papers

EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z)
EpiCoder: Encompassing Diversity and Complexity in Code Generation [49.170195362149386]
We introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST) Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels.
arXiv Detail & Related papers (2025-01-08T18:58:15Z)
Constraint Back-translation Improves Complex Instruction Following of Large Language Models [55.60192044049083]
Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc. Previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs. We propose a novel data generation technique, constraint back-translation.
arXiv Detail & Related papers (2024-10-31T17:42:26Z)
From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models [43.869374263102934]
We study what training data is effective in enhancing complex constraints following abilities. We find that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions. Our methods improve models' ability to follow instructions generally and generalize effectively across out-of-domain, in-domain, and adversarial settings.
arXiv Detail & Related papers (2024-04-24T12:51:14Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering [53.56653281752486]
This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets. We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps. We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
arXiv Detail & Related papers (2024-02-17T05:10:18Z)
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions [34.89012022437519]
Large language models (LLMs) have exhibited impressive instruction-following capabilities. It is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. We propose a new benchmark CoDI-Eval to evaluate LLMs' responses to instructions with various constraints.
arXiv Detail & Related papers (2024-01-01T07:35:31Z)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models. We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.