Related papers: Check-Eval: A Checklist-based Approach for Evaluating Text Quality

Check-Eval: A Checklist-based Approach for Evaluating Text Quality

URL: http://arxiv.org/abs/2407.14467v2
Date: Tue, 10 Sep 2024 14:08:29 GMT
Title: Check-Eval: A Checklist-based Approach for Evaluating Text Quality
Authors: Jayr Pereira, Andre Assumpcao, Roberto Lotufo,
Abstract summary: textscCheck-Eval can be employed as both a reference-free and reference-dependent evaluation method. textscCheck-Eval achieves higher correlations with human judgments compared to existing metrics.
Score: 3.031375888004876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the quality of text generated by large language models (LLMs) remains a significant challenge. Traditional metrics often fail to align well with human judgments, particularly in tasks requiring creativity and nuance. In this paper, we propose \textsc{Check-Eval}, a novel evaluation framework leveraging LLMs to assess the quality of generated text through a checklist-based approach. \textsc{Check-Eval} can be employed as both a reference-free and reference-dependent evaluation method, providing a structured and interpretable assessment of text quality. The framework consists of two main stages: checklist generation and checklist evaluation. We validate \textsc{Check-Eval} on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and \textsc{SummEval}. Our results demonstrate that \textsc{Check-Eval} achieves higher correlations with human judgments compared to existing metrics, such as \textsc{G-Eval} and \textsc{GPTScore}, underscoring its potential as a more reliable and effective evaluation framework for natural language generation tasks. The code for our experiments is available at \url{https://anonymous.4open.science/r/check-eval-0DB4}

Related papers

CEval: A Benchmark for Evaluating Counterfactual Text Generation [2.899704155417792]
We propose CEval, a benchmark for comparing counterfactual text generation methods. Our experiments found no perfect method for generating counterfactual text. By making CEval available as an open-source Python library, we encourage the community to contribute more methods.
arXiv Detail & Related papers (2024-04-26T15:23:47Z)
Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models [1.565361244756411]
This paper explores how large language models (LLMs) can be used to generate and evaluate reading comprehension items. We developed a protocol for human and automatic evaluation, including a metric we call text informativity. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2.
arXiv Detail & Related papers (2024-04-11T13:11:21Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task. We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z)
Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind. A final factuality score is computed by an adjustable scoring mechanism. Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z)
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference. The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG) We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z)
GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text. GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.