ReviewScore: Misinformed Peer Review Detection with Large Language Models
- URL: http://arxiv.org/abs/2509.21679v1
- Date: Thu, 25 Sep 2025 22:55:05 GMT
- Title: ReviewScore: Misinformed Peer Review Detection with Large Language Models
- Authors: Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang,
- Abstract summary: We show that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed.<n>We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation.<n>We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality.
- Score: 38.92827930465428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
Related papers
- Is Peer Review Really in Decline? Analyzing Review Quality across Venues and Time [55.756345497678204]
We introduce a new framework for evidence-based comparative study of review quality.<n>We apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL.<n>We study the relationships between measurements of review quality, and its evolution over time.
arXiv Detail & Related papers (2026-01-21T16:48:29Z) - ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation [3.9199635838637072]
ReviewGuard is an automated system for detecting and categorizing deficient reviews.<n>It produces a final corpus of 6,634 papers, 24,657 real reviews, and 46,438 synthetic reviews.<n> deficient reviews demonstrate lower rating scores, higher self-reported confidence, reduced structural complexity, and a higher proportion of negative sentiment.
arXiv Detail & Related papers (2025-10-18T15:45:26Z) - The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors [45.98233565214142]
We identify four key aspects of review comments that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness.<n>We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes.<n>We benchmark fine-tuned models for assessing review comments on these aspects and generating rationales.
arXiv Detail & Related papers (2025-08-31T14:19:07Z) - Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z) - Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025 [115.86204862475864]
Review Feedback Agent provides automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers.<n>It was implemented at ICLR 2025 as a large randomized control study.<n> 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers.
arXiv Detail & Related papers (2025-04-13T22:01:25Z) - ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews [24.566487721847597]
Academic paper review is a critical yet time-consuming task within the research community.<n>With the increasing volume of academic publications, automating the review process has become a significant challenge.<n>We propose ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews.
arXiv Detail & Related papers (2025-03-11T14:56:58Z) - Paper Quality Assessment based on Individual Wisdom Metrics from Open Peer Review [4.35783648216893]
Traditional closed peer review systems are slow, costly, non-transparent, and possibly subject to biases.<n>We propose and examine the efficacy and accuracy of an alternative form of scientific peer review: through an open, bottom-up process.
arXiv Detail & Related papers (2025-01-22T17:00:27Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - When Reviewers Lock Horn: Finding Disagreement in Scientific Peer
Reviews [24.875901048855077]
We introduce a novel task of automatically identifying contradictions among reviewers on a given article.
To the best of our knowledge, we make the first attempt to identify disagreements among peer reviewers automatically.
arXiv Detail & Related papers (2023-10-28T11:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.