FollowEval: A Multi-Dimensional Benchmark for Assessing the
  Instruction-Following Capability of Large Language Models
        - URL: http://arxiv.org/abs/2311.09829v1
- Date: Thu, 16 Nov 2023 11:53:31 GMT
- Title: FollowEval: A Multi-Dimensional Benchmark for Assessing the
  Instruction-Following Capability of Large Language Models
- Authors: Yimin Jing, Renren Jin, Jiahao Hu, Huishi Qiu, Xiaohua Wang, Peng
  Wang, Deyi Xiong
- Abstract summary: FollowEval benchmark is composed of instances in both English and Chinese.
Each test example is designed to evaluate more than one dimension.
We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans.
- Score: 42.72420855478716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   The effective assessment of the instruction-following ability of large
language models (LLMs) is of paramount importance. A model that cannot adhere
to human instructions might be not able to provide reliable and helpful
responses. In pursuit of this goal, various benchmarks have been constructed to
evaluate the instruction-following capacity of these models. However, these
benchmarks are limited to a single language and are constructed using automated
approaches, which restricts their applicability and the quality of the test
examples they contain. To bridge this gap, we introduce the FollowEval
benchmark in this paper. This benchmark is composed of instances in both
English and Chinese, and all test examples are crafted by human experts.
Furthermore, the FollowEval benchmark is designed to assess LLMs across five
critical dimensions of instruction following: string manipulation, commonsense
reasoning, logical reasoning, spatial reasoning, and response constraints. To
enhance the complexity and present a sufficient challenge, each test example is
designed to evaluate more than one dimension. We have evaluated various LLMs
using the FollowEval benchmark and found that their performance significantly
lags behind that of humans. This highlights the considerable room for
improvement in the instruction-following ability of these models.
 
      
        Related papers
        - Evaluating Large Language Models for the Generation of Unit Tests with   Equivalence Partitions and Boundary Values [42.88667535189424]
 This research evaluates the potential of Large Language Models (LLMs) in automatically generating test cases.<n>An optimized prompt was developed, that integrates code and requirements, covering critical cases such as equivalence partitions and boundary values.<n>The results show that the effectiveness of LLMs depends on well-designed prompts, robust implementation, and precise requirements.
 arXiv  Detail & Related papers  (2025-05-14T22:22:15Z)
- Beyond the Singular: The Essential Role of Multiple Generations in   Effective Benchmark Evaluation and Analysis [10.133537818749291]
 Large language models (LLMs) have demonstrated significant utilities in real-world applications.
 Benchmark evaluations are crucial for assessing the capabilities of LLMs.
 arXiv  Detail & Related papers  (2025-02-13T03:43:33Z)
- M-IFEval: Multilingual Instruction-Following Evaluation [2.624902795082451]
 The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria.
It only includes English instructions, limiting its ability to assess LLMs in other languages.
We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions.
 arXiv  Detail & Related papers  (2025-02-07T06:27:04Z)
- Find the Intention of Instruction: Comprehensive Evaluation of   Instruction Understanding for Large Language Models [8.020688053947547]
 One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions.
This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields.
We have noted that LLMs can become easily distracted by instruction-formatted statements, which may lead to an oversight of their instruction comprehension skills.
 arXiv  Detail & Related papers  (2024-12-27T04:37:39Z)
- A Preliminary Study of Multilingual Code Language Models for Code   Generation Task Using Translated Benchmarks [0.0]
 We evaluate the performance of Poly-Coder, a pioneering open-source, multilingual CLM built for code generation.
Our results suggest that the outcomes observed in these translated benchmarks align well with evaluation metrics used during the training phase.
These initial insights highlight the need for more comprehensive empirical studies.
 arXiv  Detail & Related papers  (2024-11-23T06:40:47Z)
- The SIFo Benchmark: Investigating the Sequential Instruction Following   Ability of Large Language Models [48.455388608863785]
 We introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following tasks.
Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules)
More recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness.
 arXiv  Detail & Related papers  (2024-06-28T15:34:26Z)
- The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of   Language Models with Language Models [94.31327813151208]
 BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
 arXiv  Detail & Related papers  (2024-06-09T12:30:30Z)
- InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
  Large Language Models [50.03163753638256]
 Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
 arXiv  Detail & Related papers  (2023-11-20T07:06:31Z)
- Benchmarking Generation and Evaluation Capabilities of Large Language   Models for Instruction Controllable Summarization [132.25202059478065]
 We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
 arXiv  Detail & Related papers  (2023-11-15T18:25:26Z)
- Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
 Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
 arXiv  Detail & Related papers  (2023-11-03T14:59:54Z)
- Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
 This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
 arXiv  Detail & Related papers  (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.