Large Language Models Are Active Critics in NLG Evaluation
- URL: http://arxiv.org/abs/2410.10724v1
- Date: Mon, 14 Oct 2024 17:04:41 GMT
- Title: Large Language Models Are Active Critics in NLG Evaluation
- Authors: Shuying Xu, Junjie Hu, Ming Jiang,
- Abstract summary: We introduce Active-Critic, a novel method for evaluating natural language generation (NLG) systems.
The protocol enables large language models (LLMs) to function as ''active critics''
Experiments show that our approach achieves stronger alignment with human judgments than state-of-the-art evaluation methods.
- Score: 9.932334723464129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The conventional paradigm of using large language models (LLMs) for evaluating natural language generation (NLG) systems typically relies on two key inputs: (1) a clear definition of the NLG task to be evaluated and (2) a list of pre-defined evaluation criteria. This process treats LLMs as ''passive critics,'' strictly following human-defined criteria for evaluation. However, as new NLG tasks emerge, the criteria for assessing text quality can vary greatly. Consequently, these rigid evaluation methods struggle to adapt to diverse NLG tasks without extensive prompt engineering customized for each specific task. To address this limitation, we introduce Active-Critic, a novel LLM-based NLG evaluation protocol that enables LLMs to function as ''active critics.'' Specifically, our protocol comprises two key stages. In the first stage, the LLM is instructed to infer the target NLG task and establish relevant evaluation criteria from the data. Building on this self-inferred information, the second stage dynamically optimizes the prompt to guide the LLM toward more human-aligned scoring decisions, while also generating detailed explanations to justify its evaluations. Experiments across four NLG evaluation tasks show that our approach achieves stronger alignment with human judgments than state-of-the-art evaluation methods. Our comprehensive analysis further highlights the effectiveness and explainability of Active-Critic with only a small amount of labeled data. We will share our code and data on GitHub.
Related papers
- Unveiling Context-Aware Criteria in Self-Assessing LLMs [28.156979106994537]
We propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance.
Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks.
Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation.
arXiv Detail & Related papers (2024-10-28T21:18:49Z) - Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4.
We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z) - Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences [11.23629471911503]
EvalGen provides automated assistance to users in generating evaluation criteria and implementing assertions.
A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment.
We identify a phenomenon we dub emphcriteria drift: users need criteria to grade outputs, but grading outputs helps users define criteria.
arXiv Detail & Related papers (2024-04-18T15:45:27Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - LLM-based NLG Evaluation: Current Status and Challenges [41.69249290537395]
evaluating natural language generation (NLG) is a vital but challenging problem in artificial intelligence.
Large language models (LLMs) have demonstrated great potential in NLG evaluation in recent years.
Various automatic evaluation methods based on LLMs have been proposed.
arXiv Detail & Related papers (2024-02-02T13:06:35Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.