CFBench: A Comprehensive Constraints-Following Benchmark for LLMs
- URL: http://arxiv.org/abs/2408.01122v1
- Date: Fri, 2 Aug 2024 09:03:48 GMT
- Title: CFBench: A Comprehensive Constraints-Following Benchmark for LLMs
- Authors: Tao Zhang, Yanjun Shen, Wenjing Luo, Yan Zhang, Hao Liang, Tao Zhang, Fan Yang, Mingan Lin, Yujing Qiao, Weipeng Chen, Bin Cui, Wentao Zhang, Zenan Zhou,
- Abstract summary: CFBench is a large-scale Comprehensive Constraints Following Benchmark for Large Language Models.
It features 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks.
CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types.
- Score: 33.19756888719116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented constraints or narrow scenarios, but they overlook the comprehensiveness and authenticity of constraints from the user's perspective. To bridge this gap, we propose CFBench, a large-scale Comprehensive Constraints Following Benchmark for LLMs, featuring 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks. CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types, which includes 10 primary categories and over 25 subcategories, and ensures each constraint is seamlessly integrated within the instructions. To make certain that the evaluation of LLM outputs aligns with user perceptions, we propose an advanced methodology that integrates multi-dimensional assessment criteria with requirement prioritization, covering various perspectives of constraints, instructions, and requirement fulfillment. Evaluating current leading LLMs on CFBench reveals substantial room for improvement in constraints following, and we further investigate influencing factors and enhancement strategies. The data and code are publicly available at https://github.com/PKU-Baichuan-MLSystemLab/CFBench
Related papers
- Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators [0.0]
Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications.
We propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics.
Our framework achieved a $7.78%$ pass rate with the human discriminator and a $6.11%$ pass rate with the LLM-based discriminator.
arXiv Detail & Related papers (2024-10-19T17:27:38Z) - Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z) - SysBench: Can Large Language Models Follow System Messages? [30.701602680394686]
Large Language Models (LLMs) have become instrumental across various applications, with the customization of these models to specific scenarios becoming increasingly critical.
Despite the recognized potential of system messages to optimize AI-driven solutions, there is a notable absence of a benchmark for evaluating how well LLMs follow system messages.
We introduce SysBench, a benchmark that systematically analyzes system message following ability in terms of three limitations of existing LLMs.
arXiv Detail & Related papers (2024-08-20T15:33:16Z) - Benchmarking Complex Instruction-Following with Multiple Constraints Composition [72.82640456309821]
How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem.
Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints.
We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
arXiv Detail & Related papers (2024-07-04T14:50:45Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - Benchmarking Large Language Models on Controllable Generation under
Diversified Instructions [34.89012022437519]
Large language models (LLMs) have exhibited impressive instruction-following capabilities.
It is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions.
We propose a new benchmark CoDI-Eval to evaluate LLMs' responses to instructions with various constraints.
arXiv Detail & Related papers (2024-01-01T07:35:31Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large
Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution.
We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios.
We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.