Related papers: A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

URL: http://arxiv.org/abs/2502.12052v1
Date: Mon, 17 Feb 2025 17:22:49 GMT
Title: A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
Authors: Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan,
Abstract summary: We propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities.<n>We also introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations.
Score: 36.83105355430611
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.

Related papers

SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection. Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains. We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models. We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4. We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods. Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z)
DEBATE: Devil's Advocate-Based Assessment and Text Evaluation [6.2689399557794525]
We propose DEBATE, an NLG evaluation framework based on multi-agent scoring system. Within the framework, one agent is instructed to criticize other agents' arguments. We show that the extensiveness of debates among agents and the persona of an agent can influence the performance of evaluators.
arXiv Detail & Related papers (2024-05-16T09:41:12Z)
Evaluation in Neural Style Transfer: A Review [0.7614628596146599]
We provide an in-depth analysis of existing evaluation techniques, identify the inconsistencies and limitations of current evaluation methods, and give recommendations for standardized evaluation practices. We believe that the development of a robust evaluation framework will not only enable more meaningful and fairer comparisons but will also enhance the comprehension and interpretation of research findings in the field.
arXiv Detail & Related papers (2024-01-30T15:45:30Z)
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods. By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z)
X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects [32.50977115108103]
We introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users. X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality.
arXiv Detail & Related papers (2023-11-15T09:01:55Z)
Hierarchical Evaluation Framework: Best Practices for Human Evaluation [17.91641890651225]
The absence of widely accepted human evaluation metrics in NLP hampers fair comparisons among different systems and the establishment of universal assessment standards. We develop our own hierarchical evaluation framework to provide a more comprehensive representation of the NLP system's performance. In future work, we will investigate the potential time-saving benefits of our proposed framework for evaluators assessing NLP systems.
arXiv Detail & Related papers (2023-10-03T09:46:02Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.