Check-Eval: A Checklist-based Approach for Evaluating Text Quality
- URL: http://arxiv.org/abs/2407.14467v1
- Date: Fri, 19 Jul 2024 17:14:16 GMT
- Title: Check-Eval: A Checklist-based Approach for Evaluating Text Quality
- Authors: Jayr Pereira, Roberto Lotufo,
- Abstract summary: Check-Eval is an evaluation framework for large language models (LLMs)
Check-Eval can be employed as both a reference-free and reference-dependent evaluation method.
We validate Check-Eval on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and SummEval.
- Score: 3.4069627091757178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the quality of text generated by large language models (LLMs) remains a significant challenge. Traditional metrics often fail to align well with human judgments, particularly in tasks requiring creativity and nuance. In this paper, we propose Check-Eval, a novel evaluation framework leveraging LLMs to assess the quality of generated text through a checklist-based approach. Check-Eval can be employed as both a reference-free and reference-dependent evaluation method, providing a structured and interpretable assessment of text quality. The framework consists of two main stages: checklist generation and checklist evaluation. We validate Check-Eval on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and SummEval. Our results demonstrate that Check-Eval achieves higher correlations with human judgments compared to existing metrics, such as G-Eval and GPTScore, underscoring its potential as a more reliable and effective evaluation framework for natural language generation tasks. The code for our experiments is available at https://anonymous.4open.science/r/check-eval-0DB4.
Related papers
- MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation [22.19073789961769]
generative Large Language Models (LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues.
We propose the MATEval: A "Multi-Agent Text Evaluation framework"
Our framework incorporates self-reflection and Chain-of-Thought strategies, along with feedback mechanisms, to enhance the depth and breadth of the evaluation process.
arXiv Detail & Related papers (2024-03-28T10:41:47Z) - CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models.
CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Exploring the Use of Large Language Models for Reference-Free Text
Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference.
The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.