Can AI Solve the Peer Review Crisis? A Large Scale Cross Model Experiment of LLMs' Performance and Biases in Evaluating over 1000 Economics Papers
- URL: http://arxiv.org/abs/2502.00070v2
- Date: Thu, 03 Apr 2025 02:12:13 GMT
- Title: Can AI Solve the Peer Review Crisis? A Large Scale Cross Model Experiment of LLMs' Performance and Biases in Evaluating over 1000 Economics Papers
- Authors: Pat Pataranutaporn, Nattavudh Powdthavee, Chayapatr Achiwaranguprok, Pattie Maes,
- Abstract summary: This study examines the potential of large language models (LLMs) to augment the academic peer review process by reliably evaluating the quality of economics research without introducing systematic bias.<n>We conduct one of the first large-scale experimental assessments of four LLMs across two complementary experiments.<n>We find that GPT, Gemma, and LLaMA assign significantly higher ratings to submissions from top male authors and elite institutions relative to the same papers presented anonymously.
- Score: 25.2441171957968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study examines the potential of large language models (LLMs) to augment the academic peer review process by reliably evaluating the quality of economics research without introducing systematic bias. We conduct one of the first large-scale experimental assessments of four LLMs (GPT-4o, Claude 3.5, Gemma 3, and LLaMA 3.3) across two complementary experiments. In the first, we use nonparametric binscatter and linear regression techniques to analyze over 29,000 evaluations of 1,220 anonymized papers drawn from 110 economics journals excluded from the training data of current LLMs, along with a set of AI-generated submissions. The results show that LLMs consistently distinguish between higher- and lower-quality research based solely on textual content, producing quality gradients that closely align with established journal prestige measures. Claude and Gemma perform exceptionally well in capturing these gradients, while GPT excels in detecting AI-generated content. The second experiment comprises 8,910 evaluations designed to assess whether LLMs replicate human like biases in single blind reviews. By systematically varying author gender, institutional affiliation, and academic prominence across 330 papers, we find that GPT, Gemma, and LLaMA assign significantly higher ratings to submissions from top male authors and elite institutions relative to the same papers presented anonymously. These results emphasize the importance of excluding author-identifying information when deploying LLMs in editorial screening. Overall, our findings provide compelling evidence and practical guidance for integrating LLMs into peer review to enhance efficiency, improve accuracy, and promote equity in the publication process of economics research.
Related papers
- LLM-REVal: Can We Trust LLM Reviewers Yet? [70.58742663985652]
Large language models (LLMs) have inspired researchers to integrate them extensively into the academic workflow.<n>This study focuses on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness.
arXiv Detail & Related papers (2025-10-14T10:30:20Z) - Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers [4.455306283717651]
The surge in scientific submissions has placed increasing strain on the traditional peer-review process.<n>We propose a comprehensive evaluation framework that integrates semantic similarity analysis and structured knowledge graph metrics.<n>We construct a benchmark of 1,683 papers and 6,495 expert reviews from ICLR and NeurIPS in multiple years, and generate reviews using five large language models.
arXiv Detail & Related papers (2025-09-13T19:15:22Z) - Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z) - Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation [0.552480439325792]
We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges.<n> o3 exhibited the best problem identification performance among all models at a modest cost.<n>This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications.
arXiv Detail & Related papers (2025-05-28T06:14:30Z) - Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values [13.798198972161657]
A number of societal problems involve the distribution of resources, where fairness, along with economic efficiency, play a critical role in the desirability of outcomes.<n>This paper examines whether large language models (LLMs) adhere to fundamental fairness concepts and investigate their alignment with human preferences.
arXiv Detail & Related papers (2025-02-01T04:24:47Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - Evaluation of Attribution Bias in Retrieval-Augmented Large Language Models [47.694137341509304]
We evaluate the attribution sensitivity and bias with respect to authorship information in large language models.
Our results show that adding authorship information to source documents can significantly change the attribution quality of LLMs by 3% to 18%.
Our findings indicate that metadata of source documents can influence LLMs' trust, and how they attribute their answers.
arXiv Detail & Related papers (2024-10-16T08:55:49Z) - Gender Bias of LLM in Economics: An Existentialism Perspective [1.024113475677323]
This paper investigates gender bias in large language models (LLMs)
LLMs reinforce gender stereotypes even without explicit gender markers.
We argue that bias in LLMs is not an unintended flaw but a systematic result of their rational processing.
arXiv Detail & Related papers (2024-10-14T01:42:01Z) - AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons.
We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs.
We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - Underneath the Numbers: Quantitative and Qualitative Gender Fairness in LLMs for Depression Prediction [10.702148378522578]
This work presents the first attempt to investigate the degree of gender bias in machine learning models for depression detection.
From our quantitative evaluation, we found that ChatGPT performs the best across various performance metrics.
We have also identified several themes adopted by LLMs to qualitatively evaluate gender fairness.
arXiv Detail & Related papers (2024-06-12T13:14:19Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals.
Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.