Pre-review to Peer review: Pitfalls of Automating Reviews using Large Language Models
- URL: http://arxiv.org/abs/2512.22145v1
- Date: Sun, 14 Dec 2025 09:56:07 GMT
- Title: Pre-review to Peer review: Pitfalls of Automating Reviews using Large Language Models
- Authors: Akhil Pandey Akella, Harish Varma Siravuri, Shaurya Rohatgi,
- Abstract summary: Large Language Models are versatile general-task solvers, and their capabilities can truly assist people with scholarly peer review as textitpre-review agents.<n>While incredibly beneficial, automating academic peer-review, as a concept, raises concerns surrounding safety, research integrity, and the validity of the academic peer-review process.
- Score: 1.8349858105838042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models are versatile general-task solvers, and their capabilities can truly assist people with scholarly peer review as \textit{pre-review} agents, if not as fully autonomous \textit{peer-review} agents. While incredibly beneficial, automating academic peer-review, as a concept, raises concerns surrounding safety, research integrity, and the validity of the academic peer-review process. The majority of the studies performing a systematic evaluation of frontier LLMs generating reviews across science disciplines miss the mark on addressing the alignment/misalignment of reviews along with the utility of LLM generated reviews when compared against publication outcomes such as \textbf{Citations}, \textbf{Hit-papers}, \textbf{Novelty}, and \textbf{Disruption}. This paper presents an experimental study in which we gathered ground-truth reviewer ratings from OpenReview and used various frontier open-weight LLMs to generate reviews of papers to gauge the safety and reliability of incorporating LLMs into the scientific review pipeline. Our findings demonstrate the utility of frontier open-weight LLMs as pre-review screening agents despite highlighting fundamental misalignment risks when deployed as autonomous reviewers. Our results show that all models exhibit weak correlation with human peer reviewers (0.15), with systematic overestimation bias of 3-5 points and uniformly high confidence scores (8.0-9.0/10) despite prediction errors. However, we also observed that LLM reviews correlate more strongly with post-publication metrics than with human scores, suggesting potential utility as pre-review screening tools. Our findings highlight the potential and address the pitfalls of automating peer reviews with language models. We open-sourced our dataset $D_{LMRSD}$ to help the research community expand the safety framework of automating scientific reviews.
Related papers
- BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? [21.78901120638025]
We investigate whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems.<n>Our generator employs presentation-manipulation strategies requiring no real experiments.<n>Despite provably sound aggregation mathematics, integrity checking systematically fails.
arXiv Detail & Related papers (2025-10-20T18:37:11Z) - ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation [3.9199635838637072]
ReviewGuard is an automated system for detecting and categorizing deficient reviews.<n>It produces a final corpus of 6,634 papers, 24,657 real reviews, and 46,438 synthetic reviews.<n> deficient reviews demonstrate lower rating scores, higher self-reported confidence, reduced structural complexity, and a higher proportion of negative sentiment.
arXiv Detail & Related papers (2025-10-18T15:45:26Z) - LLM-REVal: Can We Trust LLM Reviewers Yet? [70.58742663985652]
Large language models (LLMs) have inspired researchers to integrate them extensively into the academic workflow.<n>This study focuses on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness.
arXiv Detail & Related papers (2025-10-14T10:30:20Z) - LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation [66.09346158850308]
We present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process.<n>LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles.<n>We evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation.
arXiv Detail & Related papers (2025-10-01T12:14:28Z) - When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review [34.067892820832405]
This paper presents a systematic evaluation of large language models (LLMs) as academic reviewers.<n>Using a curated dataset of 1,441 papers from ICLR 2023 and NeurIPS 2022, we evaluate GPT-5-mini against human reviewers across ratings, strengths, and weaknesses.<n>Our findings show that LLMs consistently inflate ratings for weaker papers while aligning more closely with human judgments on stronger contributions.
arXiv Detail & Related papers (2025-09-12T00:57:50Z) - Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z) - LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z) - Automatic Evaluation Metrics for Artificially Generated Scientific Research [3.9845810840390743]
We investigate two automatic evaluation metrics, specifically citation count prediction and review score prediction.<n>Our findings reveal that citation count prediction is more viable than review score prediction, and predicting scores is more difficult purely from the research hypothesis than from the full paper.
arXiv Detail & Related papers (2025-02-14T14:56:14Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons.
We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs.
We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.