Towards Better Evaluation of Instruction-Following: A Case-Study in
Summarization
- URL: http://arxiv.org/abs/2310.08394v2
- Date: Fri, 20 Oct 2023 10:42:59 GMT
- Title: Towards Better Evaluation of Instruction-Following: A Case-Study in
Summarization
- Authors: Ondrej Skopek, Rahul Aralikatte, Sian Gooding, Victor Carbune
- Abstract summary: We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models.
Using riSum, we analyze the agreement between evaluation methods and human judgment.
- Score: 9.686937153317809
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite recent advances, evaluating how well large language models (LLMs)
follow user instructions remains an open problem. While evaluation methods of
language models have seen a rise in prompt-based approaches, limited work on
the correctness of these methods has been conducted. In this work, we perform a
meta-evaluation of a variety of metrics to quantify how accurately they measure
the instruction-following abilities of LLMs. Our investigation is performed on
grounded query-based summarization by collecting a new short-form, real-world
dataset riSum, containing 300 document-instruction pairs with 3 answers each.
All 900 answers are rated by 3 human annotators. Using riSum, we analyze the
agreement between evaluation methods and human judgment. Finally, we propose
new LLM-based reference-free evaluation methods that improve upon established
baselines and perform on par with costly reference-based metrics that require
high-quality summaries.
Related papers
- ReIFE: Re-evaluating Instruction-Following Evaluation [105.75525154888655]
We present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 proposed evaluation protocols.
Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness.
arXiv Detail & Related papers (2024-10-09T17:14:50Z) - Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation [2.4889060833127665]
In this paper, we focus on evaluating the instruction-following ability of Large Language Models (LLMs) in the context of story-ending generation.
We propose an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction.
arXiv Detail & Related papers (2024-06-24T06:53:36Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? [3.1706553206969925]
We perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks.
We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent.
Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
arXiv Detail & Related papers (2024-02-16T15:48:33Z) - SemScore: Automated Evaluation of Instruction-Tuned LLMs based on
Semantic Textual Similarity [3.3162484539136416]
We propose a simple but remarkably effective evaluation metric called SemScore.
We compare model outputs to gold target responses using semantic textual similarity (STS)
We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation.
arXiv Detail & Related papers (2024-01-30T14:52:50Z) - LLMEval: A Preliminary Study on How to Evaluate Large Language Models [47.12588320134504]
We analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4.
A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results.
arXiv Detail & Related papers (2023-12-12T16:14:43Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.