Large Language Models Are Active Critics in NLG Evaluation
- URL: http://arxiv.org/abs/2410.10724v2
- Date: Mon, 17 Feb 2025 10:07:26 GMT
- Title: Large Language Models Are Active Critics in NLG Evaluation
- Authors: Shuying Xu, Junjie Hu, Ming Jiang,
- Abstract summary: Active-Critic is a novel evaluator that transforms large language models (LLMs) into "active critics"
Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria.
- Score: 9.932334723464129
- License:
- Abstract: The conventional paradigm of using large language models (LLMs) for natural language generation (NLG) evaluation relies on pre-defined task definitions and evaluation criteria, positioning LLMs as "passive critics" that strictly follow developer-provided guidelines. However, human evaluators often apply implicit criteria, and their expectations in practice can vary widely based on specific end-user needs. Consequently, these rigid evaluation methods struggle to adapt to diverse scenarios without extensive prompt customization. To address this, we introduce Active-Critic, a novel LLM-based evaluator that transforms LLMs into "active critics'' capable of adapting to diverse NLG tasks using limited example data. Active-Critic consists of two stages: (1) self-inferring the target NLG task and relevant evaluation criteria, and (2) dynamically optimizing prompts to produce human-aligned scores along with detailed justifications. Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria, enabling it to achieve superior alignment with human judgments across multiple tasks.
Related papers
- Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4.
We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z) - TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot [2.186726107112913]
We propose a model-based evaluation method: TALEC.
It allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria.
TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments.
arXiv Detail & Related papers (2024-06-25T10:02:42Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.
This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)
We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - EvalLM: Interactive Evaluation of Large Language Model Prompts on
User-Defined Criteria [43.944632774725484]
We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria.
By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail.
A comparative study showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions.
arXiv Detail & Related papers (2023-09-24T13:19:38Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.