Exploring the Use of Large Language Models for Reference-Free Text
Quality Evaluation: An Empirical Study
- URL: http://arxiv.org/abs/2304.00723v3
- Date: Mon, 18 Sep 2023 03:52:18 GMT
- Title: Exploring the Use of Large Language Models for Reference-Free Text
Quality Evaluation: An Empirical Study
- Authors: Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu
- Abstract summary: ChatGPT is capable of evaluating text quality effectively from various perspectives without reference.
The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
- Score: 63.27346930921658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the quality of generated text is a challenging task in NLP, due to
the inherent complexity and diversity of text. Recently, large language models
(LLMs) have garnered significant attention due to their impressive performance
in various tasks. Therefore, we present this paper to investigate the
effectiveness of LLMs, especially ChatGPT, and explore ways to optimize their
use in assessing text quality. We compared three kinds of reference-free
evaluation methods. The experimental results prove that ChatGPT is capable of
evaluating text quality effectively from various perspectives without reference
and demonstrates superior performance than most existing automatic metrics. In
particular, the Explicit Score, which utilizes ChatGPT to generate a numeric
score measuring text quality, is the most effective and reliable method among
the three exploited approaches. However, directly comparing the quality of two
texts may lead to suboptimal results. We believe this paper will provide
valuable insights for evaluating text quality with LLMs and have released the
used data.
Related papers
- Multi-Facet Counterfactual Learning for Content Quality Evaluation [48.73583736357489]
We propose a framework for efficiently constructing evaluators that perceive multiple facets of content quality evaluation.
We leverage a joint training strategy based on contrastive learning and supervised learning to enable the evaluator to distinguish between different quality facets.
arXiv Detail & Related papers (2024-10-10T08:04:10Z) - Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs [3.9627148816681284]
This article assesses which ChatGPT inputs produce better quality score estimates.
The optimal input is the article title and abstract, with average ChatGPT scores based on these correlating at 0.67 with human scores.
arXiv Detail & Related papers (2024-08-13T09:19:21Z) - Check-Eval: A Checklist-based Approach for Evaluating Text Quality [3.031375888004876]
textscCheck-Eval can be employed as both a reference-free and reference-dependent evaluation method.
textscCheck-Eval achieves higher correlations with human judgments compared to existing metrics.
arXiv Detail & Related papers (2024-07-19T17:14:16Z) - A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - ChatGPT as a Factual Inconsistency Evaluator for Text Summarization [17.166794984161964]
We show that ChatGPT can evaluate factual inconsistency under a zero-shot setting.
It generally outperforms previous evaluation metrics on binary entailment inference, summary ranking, and consistency rating.
However, a closer inspection of ChatGPT's output reveals certain limitations including its preference for more lexically similar candidates, false reasoning, and inadequate understanding of instructions.
arXiv Detail & Related papers (2023-03-27T22:30:39Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - TextGAIL: Generative Adversarial Imitation Learning for Text Generation [68.3579946817937]
We propose a generative adversarial imitation learning framework for text generation that uses large pre-trained language models to provide more reliable reward guidance.
Our approach uses contrastive discriminator, and proximal policy optimization (PPO) to stabilize and improve text generation performance.
arXiv Detail & Related papers (2020-04-07T00:24:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.