An In-depth Evaluation of GPT-4 in Sentence Simplification with
Error-based Human Assessment
- URL: http://arxiv.org/abs/2403.04963v1
- Date: Fri, 8 Mar 2024 00:19:24 GMT
- Title: An In-depth Evaluation of GPT-4 in Sentence Simplification with
Error-based Human Assessment
- Authors: Xuanxin Wu and Yuki Arase
- Abstract summary: We design an error-based human annotation framework to assess the GPT-4's simplification capabilities.
Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art.
- Score: 10.816677544269782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sentence simplification, which rewrites a sentence to be easier to read and
understand, is a promising technique to help people with various reading
difficulties. With the rise of advanced large language models (LLMs),
evaluating their performance in sentence simplification has become imperative.
Recent studies have used both automatic metrics and human evaluations to assess
the simplification abilities of LLMs. However, the suitability of existing
evaluation methodologies for LLMs remains in question. First, the suitability
of current automatic metrics on LLMs' simplification evaluation is still
uncertain. Second, current human evaluation approaches in sentence
simplification often fall into two extremes: they are either too superficial,
failing to offer a clear understanding of the models' performance, or overly
detailed, making the annotation process complex and prone to inconsistency,
which in turn affects the evaluation's reliability. To address these problems,
this study provides in-depth insights into LLMs' performance while ensuring the
reliability of the evaluation. We design an error-based human annotation
framework to assess the GPT-4's simplification capabilities. Results show that
GPT-4 generally generates fewer erroneous simplification outputs compared to
the current state-of-the-art. However, LLMs have their limitations, as seen in
GPT-4's struggles with lexical paraphrasing. Furthermore, we conduct
meta-evaluations on widely used automatic metrics using our human annotations.
We find that while these metrics are effective for significant quality
differences, they lack sufficient sensitivity to assess the overall
high-quality simplification by GPT-4.
Related papers
- SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - InFoBench: Evaluating Instruction Following Ability in Large Language
Models [57.27152890085759]
Decomposed Requirements Following Ratio (DRFR) is a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions.
We present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories.
arXiv Detail & Related papers (2024-01-07T23:01:56Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large
Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation.
We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics.
We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z) - Simplicity Level Estimate (SLE): A Learned Reference-Less Metric for
Sentence Simplification [8.479659578608233]
We propose a new learned evaluation metric (SLE) for sentence simplification.
SLE focuses on simplicity, outperforming almost all existing metrics in terms of correlation with human judgements.
arXiv Detail & Related papers (2023-10-12T09:49:10Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Dancing Between Success and Failure: Edit-level Simplification
Evaluation using SALSA [21.147261039292026]
We introduce SALSA, an edit-based human annotation framework.
We develop twenty one linguistically grounded edit types, covering the full spectrum of success and failure.
We develop LENS-SALSA, a reference-free automatic simplification metric, trained to predict sentence- and word-level quality simultaneously.
arXiv Detail & Related papers (2023-05-23T18:30:49Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.