Related papers: Calibrating LLM-Based Evaluator

Calibrating LLM-Based Evaluator

URL: http://arxiv.org/abs/2309.13308v1
Date: Sat, 23 Sep 2023 08:46:11 GMT
Title: Calibrating LLM-Based Evaluator
Authors: Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
Abstract summary: We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
Score: 92.17397504834825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.

Related papers

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics [18.936553687978087]
Large language models (LLMs) are increasingly used as evaluators for natural language generation.<n>We introduce GER-Eval to investigate whether LLMs can design and apply their own evaluation rubrics.
arXiv Detail & Related papers (2026-02-09T13:56:06Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models [7.8905223445925055]
We propose SLMEval, a calibration method based on entropy over a small amount of human preference data.<n>It achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark.<n> SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based evaluators such as G-eval.
arXiv Detail & Related papers (2025-05-21T20:40:30Z)
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications [0.0]
Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation. We propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications.
arXiv Detail & Related papers (2025-04-01T09:36:56Z)
CritiQ: Mining Data Quality Criteria from Human Preferences [70.35346554179036]
We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality. CritiQ Flow employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We demonstrate the effectiveness of our method in the code, math, and logic domains.
arXiv Detail & Related papers (2025-02-26T16:33:41Z)
SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics. SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric. We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z)
Unveiling Context-Aware Criteria in Self-Assessing LLMs [28.156979106994537]
We propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance. Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks. Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
Mitigating the Bias of Large Language Model Evaluation [30.67730115141905]
We propose systematic research about the bias of LLM-as-a-Judge. For closed-source judge models, we apply calibration to mitigate the significance of superficial quality. For open-source judge models, we propose to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality.
arXiv Detail & Related papers (2024-09-25T09:52:44Z)
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z)
Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments [41.25558612970942]
We show that large language models (LLMs) exhibit preference biases and worrying sensitivity to prompt designs. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO.
arXiv Detail & Related papers (2024-06-17T09:48:53Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities in assessing the quality of generated natural language. LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. We introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference. HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators. Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.