Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling
- URL: http://arxiv.org/abs/2510.08145v1
- Date: Thu, 09 Oct 2025 12:32:31 GMT
- Title: Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling
- Authors: Shuliang Liu, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Minghe Yu, Yu Gu, Chong Chen, Huiyuan Xie, Ge Yu,
- Abstract summary: Large Language Models (LLMs) as automatic evaluators have attracted growing attention.<n>LLMs tend to favor responses generated by themselves, undermining the reliability of their judgments.<n>This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework.
- Score: 26.377421806098187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.
Related papers
- JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation [13.831735556002426]
Small language models (SLMs) have shown promise on various reasoning tasks.<n>Their ability to judge the correctness of answers remains unclear compared to large language models (LLMs)
arXiv Detail & Related papers (2025-11-20T01:14:39Z) - CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z) - Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z) - Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs [21.02718263476369]
Multimodal Large Language Models (MLLMs) can enhance trustworthiness by aligning with human preferences.
Recent works employ evaluation models for assessing MLLMs' responses, using the model-based assessments to automate preference dataset construction.
We propose DecompGen, a decomposable framework that uses an ensemble of open-sourced expert models.
arXiv Detail & Related papers (2024-11-20T20:28:04Z) - Direct Judgement Preference Optimization [79.54459973726405]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs.<n>We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective.<n>Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z) - Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments [41.25558612970942]
We show that large language models (LLMs) exhibit preference biases and worrying sensitivity to prompt designs.
Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO.
arXiv Detail & Related papers (2024-06-17T09:48:53Z) - Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning.
We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient.
We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.