HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition
- URL: http://arxiv.org/abs/2402.15754v1
- Date: Sat, 24 Feb 2024 08:01:32 GMT
- Title: HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition
- Authors: Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang,
Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
- Abstract summary: HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
- Score: 92.17397504834825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have emerged as a promising alternative to
expensive human evaluations. However, the alignment and coverage of LLM-based
evaluations are often limited by the scope and potential bias of the evaluation
prompts and criteria. To address this challenge, we propose HD-Eval, a novel
framework that iteratively aligns LLM-based evaluators with human preference
via Hierarchical Criteria Decomposition. HD-Eval inherits the essence from the
evaluation mindset of human experts and enhances the alignment of LLM-based
evaluators by decomposing a given evaluation task into finer-grained criteria,
aggregating them according to estimated human preferences, pruning
insignificant criteria with attribution, and further decomposing significant
criteria. By integrating these steps within an iterative alignment training
process, we obtain a hierarchical decomposition of criteria that
comprehensively captures aspects of natural language at multiple levels of
granularity. Implemented as a white box, the human preference-guided aggregator
is efficient to train and more explainable than relying solely on prompting,
and its independence from model parameters makes it applicable to closed-source
LLMs. Extensive experiments on three evaluation domains demonstrate the
superiority of HD-Eval in further aligning state-of-the-art evaluators and
providing deeper insights into the explanation of evaluation results and the
task itself.
Related papers
- HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)
In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.
We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Large Language Models Are Active Critics in NLG Evaluation [9.932334723464129]
Active-Critic is a novel evaluator that transforms large language models (LLMs) into "active critics"
Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria.
arXiv Detail & Related papers (2024-10-14T17:04:41Z) - Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences [11.23629471911503]
EvalGen provides automated assistance to users in generating evaluation criteria and implementing assertions.
A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment.
We identify a phenomenon we dub emphcriteria drift: users need criteria to grade outputs, but grading outputs helps users define criteria.
arXiv Detail & Related papers (2024-04-18T15:45:27Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.
LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.
We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.
This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)
We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.