Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization
- URL: http://arxiv.org/abs/2305.13091v2
- Date: Fri, 20 Oct 2023 03:47:27 GMT
- Title: Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization
- Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing
- Abstract summary: We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
- Score: 66.08074487429477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the recent undeniable advancement in reasoning abilities in large
language models (LLMs) like ChatGPT and GPT-4, there is a growing trend for
using LLMs on various tasks. One area where LLMs can be employed is as an
alternative evaluation metric for complex generative tasks, which generally
demands expensive human judges to complement the traditional automatic metrics
for various evaluation dimensions such as fluency and consistency. In this
work, we conduct extensive analysis to investigate the stability and
reliability of LLMs as automatic evaluators for abstractive summarization. We
found that while ChatGPT and GPT-4 outperform the commonly used automatic
metrics, they are not ready as human replacements due to significant
limitations. That is, LLM evaluators rate each candidate system inconsistently
and are dimension-dependent. They also struggle to compare candidates with
close performance and become more unreliable with higher-quality summaries by
obtaining a lower correlation with humans. In other words, with better
abstractive summarization systems being introduced at a fast pace, LLMs may
result in misleading and unreliable evaluations.
Related papers
- Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models [47.545382591646565]
Large Language Models (LLMs) have excelled at language understanding and generating human-level text.
LLMs are susceptible to adversarial attacks where malicious users prompt the model to generate undesirable text.
In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs.
arXiv Detail & Related papers (2024-08-07T17:11:34Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition.
MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs.
The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z) - An In-depth Evaluation of GPT-4 in Sentence Simplification with
Error-based Human Assessment [10.816677544269782]
We design an error-based human annotation framework to assess the GPT-4's simplification capabilities.
Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art.
arXiv Detail & Related papers (2024-03-08T00:19:24Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Text Style Transfer Evaluation Using Large Language Models [24.64611983641699]
Large Language Models (LLMs) have shown their capacity to match and even exceed average human performance.
We compare the results of different LLMs in TST using multiple input prompts.
Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics.
arXiv Detail & Related papers (2023-08-25T13:07:33Z) - Assessing Hidden Risks of LLMs: An Empirical Study on Robustness,
Consistency, and Credibility [37.682136465784254]
We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT.
We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level.
We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
arXiv Detail & Related papers (2023-05-15T15:44:51Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.