Related papers: InFoBench: Evaluating Instruction Following Ability in Large Language Models

InFoBench: Evaluating Instruction Following Ability in Large Language Models

URL: http://arxiv.org/abs/2401.03601v1
Date: Sun, 7 Jan 2024 23:01:56 GMT
Title: InFoBench: Evaluating Instruction Following Ability in Large Language Models
Authors: Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, Dong Yu
Abstract summary: Decomposed Requirements Following Ratio (DRFR) is a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. We present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories.
Score: 57.27152890085759
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.

Related papers

Text Chunking for Document Classification for Urban System Management using Large Language Models [0.0]
Urban systems are managed using complex textual documentation to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements.
arXiv Detail & Related papers (2025-03-31T22:48:30Z)
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval [11.978909077813556]
We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature.
arXiv Detail & Related papers (2025-03-06T17:32:22Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Diverse and Fine-Grained Instruction-Following Ability Exploration with Synthetic Data [20.451720017247066]
This paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset. It is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests.
arXiv Detail & Related papers (2024-07-04T13:54:41Z)
Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study [19.461541208547136]
This paper studies the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot and few-shot regimes.
arXiv Detail & Related papers (2024-06-17T15:11:58Z)
Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants [28.691691883519542]
We introduce a technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants. Based on DeMoRecon, we developed the FGIV dataset which contains fine-grained instruction variants of 1,773 seed instructions. Our findings show that LLMs fine-tuned with FGIV will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
arXiv Detail & Related papers (2024-06-17T08:08:11Z)
Large Language Models as Evaluators for Recommendation Explanations [23.938202791437337]
We investigate whether LLMs can serve as evaluators of recommendation explanations. We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users. Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts.
arXiv Detail & Related papers (2024-06-05T13:23:23Z)
FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models [42.72420855478716]
FollowEval benchmark is composed of instances in both English and Chinese. Each test example is designed to evaluate more than one dimension. We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans.
arXiv Detail & Related papers (2023-11-16T11:53:31Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Which is better? Exploring Prompting Strategy For LLM-based Metrics [6.681126871165601]
This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task. Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks.
arXiv Detail & Related papers (2023-11-07T06:36:39Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
Towards Building the Federated GPT: Federated Instruction Tuning [66.7900343035733]
This paper introduces Federated Instruction Tuning (FedIT) as the learning framework for the instruction tuning of large language models (LLMs) We demonstrate that by exploiting the heterogeneous and diverse sets of instructions on the client's end with FedIT, we improved the performance of LLMs compared to centralized training with only limited local instructions.
arXiv Detail & Related papers (2023-05-09T17:42:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.