FollowEval: A Multi-Dimensional Benchmark for Assessing the
Instruction-Following Capability of Large Language Models
- URL: http://arxiv.org/abs/2311.09829v1
- Date: Thu, 16 Nov 2023 11:53:31 GMT
- Title: FollowEval: A Multi-Dimensional Benchmark for Assessing the
Instruction-Following Capability of Large Language Models
- Authors: Yimin Jing, Renren Jin, Jiahao Hu, Huishi Qiu, Xiaohua Wang, Peng
Wang, Deyi Xiong
- Abstract summary: FollowEval benchmark is composed of instances in both English and Chinese.
Each test example is designed to evaluate more than one dimension.
We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans.
- Score: 42.72420855478716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The effective assessment of the instruction-following ability of large
language models (LLMs) is of paramount importance. A model that cannot adhere
to human instructions might be not able to provide reliable and helpful
responses. In pursuit of this goal, various benchmarks have been constructed to
evaluate the instruction-following capacity of these models. However, these
benchmarks are limited to a single language and are constructed using automated
approaches, which restricts their applicability and the quality of the test
examples they contain. To bridge this gap, we introduce the FollowEval
benchmark in this paper. This benchmark is composed of instances in both
English and Chinese, and all test examples are crafted by human experts.
Furthermore, the FollowEval benchmark is designed to assess LLMs across five
critical dimensions of instruction following: string manipulation, commonsense
reasoning, logical reasoning, spatial reasoning, and response constraints. To
enhance the complexity and present a sufficient challenge, each test example is
designed to evaluate more than one dimension. We have evaluated various LLMs
using the FollowEval benchmark and found that their performance significantly
lags behind that of humans. This highlights the considerable room for
improvement in the instruction-following ability of these models.
Related papers
- A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks [0.0]
We evaluate the performance of Poly-Coder, a pioneering open-source, multilingual CLM built for code generation.
Our results suggest that the outcomes observed in these translated benchmarks align well with evaluation metrics used during the training phase.
These initial insights highlight the need for more comprehensive empirical studies.
arXiv Detail & Related papers (2024-11-23T06:40:47Z) - The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models [48.455388608863785]
We introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following tasks.
Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules)
More recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness.
arXiv Detail & Related papers (2024-06-28T15:34:26Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.