Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews
- URL: http://arxiv.org/abs/2509.13400v4
- Date: Fri, 26 Sep 2025 11:58:54 GMT
- Title: Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews
- Authors: Sai Suresh Macharla Vasu, Ivaxi Sheth, Hui-Po Wang, Ruta Binkyte, Mario Fritz,
- Abstract summary: We investigate bias in large language models (LLMs)-generated peer reviews by conducting experiments on sensitive metadata.<n>Our analysis consistently shows affiliation bias favoring institutions highly ranked on common academic rankings.<n>We uncover implicit biases that become more evident with token-based soft ratings.
- Score: 38.50822587716282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The adoption of large language models (LLMs) is transforming the peer review process, from assisting reviewers in writing more detailed evaluations to generating entire reviews automatically. While these capabilities offer exciting opportunities, they also raise critical concerns about fairness and reliability. In this paper, we investigate bias in LLM-generated peer reviews by conducting controlled experiments on sensitive metadata, including author affiliation and gender. Our analysis consistently shows affiliation bias favoring institutions highly ranked on common academic rankings. Additionally, we find some gender preferences, which, even though subtle in magnitude, have the potential to compound over time. Notably, we uncover implicit biases that become more evident with token-based soft ratings.
Related papers
- Pre-review to Peer review: Pitfalls of Automating Reviews using Large Language Models [1.8349858105838042]
Large Language Models are versatile general-task solvers, and their capabilities can truly assist people with scholarly peer review as textitpre-review agents.<n>While incredibly beneficial, automating academic peer-review, as a concept, raises concerns surrounding safety, research integrity, and the validity of the academic peer-review process.
arXiv Detail & Related papers (2025-12-14T09:56:07Z) - When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review [34.067892820832405]
This paper presents a systematic evaluation of large language models (LLMs) as academic reviewers.<n>Using a curated dataset of 1,441 papers from ICLR 2023 and NeurIPS 2022, we evaluate GPT-5-mini against human reviewers across ratings, strengths, and weaknesses.<n>Our findings show that LLMs consistently inflate ratings for weaker papers while aligning more closely with human judgments on stronger contributions.
arXiv Detail & Related papers (2025-09-12T00:57:50Z) - Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge [70.89799989428367]
We conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias.<n>We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge.
arXiv Detail & Related papers (2025-05-26T03:56:41Z) - Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews [46.0003776499898]
Large Language Models (LLMs) can automatically draft reviews now.<n> determining whether LLM-generated reviews are trustworthy requires systematic evaluation.<n>We introduce a focus-level evaluation framework that operationalizes the focus as a normalized distribution of attention.
arXiv Detail & Related papers (2025-02-24T12:05:27Z) - Paper Quality Assessment based on Individual Wisdom Metrics from Open Peer Review [4.35783648216893]
Traditional closed peer review systems are slow, costly, non-transparent, and possibly subject to biases.<n>We propose and examine the efficacy and accuracy of an alternative form of scientific peer review: through an open, bottom-up process.
arXiv Detail & Related papers (2025-01-22T17:00:27Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons.
We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs.
We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z) - Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction.
This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs)
Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors.
We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.