Benchmarking Large Language Models on Controllable Generation under
Diversified Instructions
- URL: http://arxiv.org/abs/2401.00690v1
- Date: Mon, 1 Jan 2024 07:35:31 GMT
- Title: Benchmarking Large Language Models on Controllable Generation under
Diversified Instructions
- Authors: Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao
- Abstract summary: Large language models (LLMs) have exhibited impressive instruction-following capabilities.
It is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions.
We propose a new benchmark CoDI-Eval to evaluate LLMs' responses to instructions with various constraints.
- Score: 34.89012022437519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large language models (LLMs) have exhibited impressive
instruction-following capabilities, it is still unclear whether and to what
extent they can respond to explicit constraints that might be entailed in
various instructions. As a significant aspect of LLM alignment, it is thus
important to formulate such a specialized set of instructions as well as
investigate the resulting behavior of LLMs. To address this vacancy, we propose
a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs'
responses to instructions with various constraints. We construct a large
collection of constraints-attributed instructions as a test suite focused on
both generalization and coverage. Specifically, we advocate an instruction
diversification process to synthesize diverse forms of constraint expression
and also deliberate the candidate task taxonomy with even finer-grained
sub-categories. Finally, we automate the entire evaluation process to
facilitate further developments. Different from existing studies on
controllable text generation, CoDI-Eval extends the scope to the prevalent
instruction-following paradigm for the first time. We provide extensive
evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval,
revealing their limitations in following instructions with specific constraints
and there is still a significant gap between open-source and commercial
closed-source LLMs. We believe this benchmark will facilitate research into
improving the controllability of LLMs' responses to instructions. Our data and
code are available at https://github.com/Xt-cyh/CoDI-Eval.
Related papers
- Divide-Verify-Refine: Aligning LLM Responses with Complex Instructions [33.18076221854853]
LLMs struggle to follow complex instructions with multiple constraints.
Recent studies show that LLMs, particularly open-source models, struggle to follow complex instructions with multiple constraints.
We propose the Divide-Verify-Refine (DVR) framework with three steps.
We show that the framework significantly improves performance, doubling LLama3.1-8B's constraint adherence on instructions with 6 constraints.
arXiv Detail & Related papers (2024-10-16T04:01:55Z) - Enhancing LLM's Cognition via Structurization [41.13997892843677]
Large language models (LLMs) process input contexts through a causal and sequential perspective.
This paper presents a novel concept of context structurization.
Specifically, we transform the plain, unordered contextual sentences into well-ordered and hierarchically structurized elements.
arXiv Detail & Related papers (2024-07-23T12:33:58Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Benchmarking Complex Instruction-Following with Multiple Constraints Composition [72.82640456309821]
How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem.
Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints.
We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
arXiv Detail & Related papers (2024-07-04T14:50:45Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - Chain-of-Specificity: An Iteratively Refining Method for Eliciting
Knowledge from Large Language Models [27.615355663475984]
Large Language Models (LLMs) exhibit remarkable generative capabilities, enabling the generation of valuable information.
Existing approaches attempted to address this issue by decomposing or rewriting input instructions.
This paper proposes a simple yet effective method named Chain-of-Specificity (CoS)
arXiv Detail & Related papers (2024-02-20T08:03:05Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs.
Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.