CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
- URL: http://arxiv.org/abs/2410.15393v1
- Date: Sun, 20 Oct 2024 13:47:39 GMT
- Title: CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
- Authors: Haitao Li, Junjie Chen, Qingyao Ai, Zhumin Chu, Yujia Zhou, Qian Dong, Yiqun Liu,
- Abstract summary: We introduce CalibraEval, a novel label-free method for mitigating selection bias during inference.
CalibraEval reformulates debiasing as an optimization task aimed at adjusting observed prediction distributions to align with unbiased prediction distributions.
We show that CalibraEval effectively mitigates selection bias and improves performance compared to existing debiasing methods.
- Score: 21.580762639442913
- License:
- Abstract: The use of large language models (LLMs) as automated evaluation tools to assess the quality of generated natural language, known as LLMs-as-Judges, has demonstrated promising capabilities and is rapidly gaining widespread attention. However, when applied to pairwise comparisons of candidate responses, LLM-based evaluators often exhibit selection bias. Specifically, their judgments may become inconsistent when the option positions or ID tokens are swapped, compromising the effectiveness and fairness of the evaluation result. To address this challenge, we introduce CalibraEval, a novel label-free method for mitigating selection bias during inference. Specifically, CalibraEval reformulates debiasing as an optimization task aimed at adjusting observed prediction distributions to align with unbiased prediction distributions. To solve this optimization problem, we propose a non-parametric order-preserving algorithm (NOA). This algorithm leverages the partial order relationships between model prediction distributions, thereby eliminating the need for explicit labels and precise mathematical function modeling.Empirical evaluations of LLMs in multiple representative benchmarks demonstrate that CalibraEval effectively mitigates selection bias and improves performance compared to existing debiasing methods. This work marks a step toward building more robust and unbiased automated evaluation frameworks, paving the way for improved reliability in AI-driven assessments
Related papers
- Mitigating Selection Bias with Node Pruning and Auxiliary Options [11.835002896308545]
Large language models (LLMs) often show unwarranted preference for certain choice options when responding to multiple-choice questions.
Previous solutions utilized debiasing methods to adjust the model's input and/or output.
Our work, in contrast, investigates the model's internal representation of the selection bias.
arXiv Detail & Related papers (2024-09-27T15:53:54Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Inference-Time Selective Debiasing [27.578390085427156]
We propose selective debiasing -- an inference-time safety mechanism that aims to increase the overall quality of models.
We identify the potentially biased model predictions and, instead of discarding them, we debias them using LEACE -- a post-processing debiasing method.
Experiments with text classification datasets demonstrate that selective debiasing helps to close the performance gap between post-processing methods and at-training and pre-processing debiasing techniques.
arXiv Detail & Related papers (2024-07-27T21:56:23Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs [56.526095828316386]
We propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of large language models (LLMs)
We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods.
arXiv Detail & Related papers (2023-10-18T03:34:59Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Optimization of Annealed Importance Sampling Hyperparameters [77.34726150561087]
Annealed Importance Sampling (AIS) is a popular algorithm used to estimates the intractable marginal likelihood of deep generative models.
We present a parameteric AIS process with flexible intermediary distributions and optimize the bridging distributions to use fewer number of steps for sampling.
We assess the performance of our optimized AIS for marginal likelihood estimation of deep generative models and compare it to other estimators.
arXiv Detail & Related papers (2022-09-27T07:58:25Z) - Calibrated Selective Classification [34.08454890436067]
We develop a new approach to selective classification in which we propose a method for rejecting examples with "uncertain" uncertainties.
We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model.
We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.
arXiv Detail & Related papers (2022-08-25T13:31:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.