InFoBench: Evaluating Instruction Following Ability in Large Language
Models
- URL: http://arxiv.org/abs/2401.03601v1
- Date: Sun, 7 Jan 2024 23:01:56 GMT
- Title: InFoBench: Evaluating Instruction Following Ability in Large Language
Models
- Authors: Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho,
Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, Dong Yu
- Abstract summary: Decomposed Requirements Following Ratio (DRFR) is a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions.
We present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories.
- Score: 57.27152890085759
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces the Decomposed Requirements Following Ratio (DRFR), a
new metric for evaluating Large Language Models' (LLMs) ability to follow
instructions. Addressing a gap in current methodologies, DRFR breaks down
complex instructions into simpler criteria, facilitating a detailed analysis of
LLMs' compliance with various aspects of tasks. Alongside this metric, we
present InFoBench, a benchmark comprising 500 diverse instructions and 2,250
decomposed questions across multiple constraint categories. Our experiments
compare DRFR with traditional scoring methods and explore annotation sources,
including human experts, crowd-sourced workers, and GPT-4. The findings
demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a
cost-efficient annotator. The evaluation of several advanced LLMs using this
framework reveals their strengths and areas needing improvement, particularly
in complex instruction-following. This study contributes a novel metric and
benchmark, offering insights for future LLM development and evaluation.
Related papers
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Diverse and Fine-Grained Instruction-Following Ability Exploration with Synthetic Data [20.451720017247066]
This paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset.
It is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests.
arXiv Detail & Related papers (2024-07-04T13:54:41Z) - Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants [28.691691883519542]
We introduce a technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants.
Based on DeMoRecon, we developed the FGIV dataset which contains fine-grained instruction variants of 1,773 seed instructions.
Our findings show that LLMs fine-tuned with FGIV will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
arXiv Detail & Related papers (2024-06-17T08:08:11Z) - Large Language Models as Evaluators for Recommendation Explanations [23.938202791437337]
We investigate whether LLMs can serve as evaluators of recommendation explanations.
We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users.
Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts.
arXiv Detail & Related papers (2024-06-05T13:23:23Z) - FollowEval: A Multi-Dimensional Benchmark for Assessing the
Instruction-Following Capability of Large Language Models [42.72420855478716]
FollowEval benchmark is composed of instances in both English and Chinese.
Each test example is designed to evaluate more than one dimension.
We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans.
arXiv Detail & Related papers (2023-11-16T11:53:31Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Which is better? Exploring Prompting Strategy For LLM-based Metrics [6.681126871165601]
This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task.
Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks.
arXiv Detail & Related papers (2023-11-07T06:36:39Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - Towards Building the Federated GPT: Federated Instruction Tuning [66.7900343035733]
This paper introduces Federated Instruction Tuning (FedIT) as the learning framework for the instruction tuning of large language models (LLMs)
We demonstrate that by exploiting the heterogeneous and diverse sets of instructions on the client's end with FedIT, we improved the performance of LLMs compared to centralized training with only limited local instructions.
arXiv Detail & Related papers (2023-05-09T17:42:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.