Related papers: Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

URL: http://arxiv.org/abs/2505.23824v2
Date: Mon, 07 Jul 2025 17:28:31 GMT
Title: Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
Authors: Tianmai M. Zhang, Neil F. Abernethy,
Abstract summary: We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges.<n> o3 exhibited the best problem identification performance among all models at a modest cost.<n>This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications.
Score: 0.552480439325792
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from multiple vendors and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers. o3 exhibited the best problem identification performance among all models at a modest cost. This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications. Our dataset, code, and model outputs are publicly available.

Related papers

Pre-review to Peer review: Pitfalls of Automating Reviews using Large Language Models [1.8349858105838042]
Large Language Models are versatile general-task solvers, and their capabilities can truly assist people with scholarly peer review as textitpre-review agents.<n>While incredibly beneficial, automating academic peer-review, as a concept, raises concerns surrounding safety, research integrity, and the validity of the academic peer-review process.
arXiv Detail & Related papers (2025-12-14T09:56:07Z)
LLM-REVal: Can We Trust LLM Reviewers Yet? [70.58742663985652]
Large language models (LLMs) have inspired researchers to integrate them extensively into the academic workflow.<n>This study focuses on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness.
arXiv Detail & Related papers (2025-10-14T10:30:20Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
Can Large Language Models Be Trusted Paper Reviewers? A Feasibility Study [24.387202495452886]
This work explores the feasibility of using Large Language Models (LLMs) for academic paper review.<n>The system integrates Retrieval Augmented Generation (RAG), the AutoGen multi-agent system, and Chain-of-Thought prompting to support tasks such as format checking, standardized evaluation, comment generation, and scoring.<n>Experiments conducted on 290 submissions from the WASA 2024 conference using GPT-4o show that LLM-based review significantly reduces review time (average 2.48 hours) and cost (average $104.28 USD)
arXiv Detail & Related papers (2025-06-18T10:19:18Z)
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review [6.20631177269082]
A new risk to the peer review process is that negligent reviewers will rely on large language models (LLMs) to review a paper.<n>We introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews.<n>We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs.
arXiv Detail & Related papers (2025-02-26T23:04:05Z)
ReviewEval: An Evaluation Framework for AI-Generated Reviews [9.35023998408983]
The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review.<n>We propose ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines.<n>This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.
arXiv Detail & Related papers (2025-02-17T12:22:11Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
Streamlining the review process: AI-generated annotations in research manuscripts [0.5735035463793009]
This study explores the potential of integrating Large Language Models (LLMs) into the peer-review process to enhance efficiency without compromising effectiveness.<n>We focus on manuscript annotations, particularly excerpt highlights, as a potential area for AI-human collaboration.<n>This paper introduces AnnotateGPT, a platform that utilizes GPT-4 for manuscript review, aiming to improve reviewers' comprehension and focus.
arXiv Detail & Related papers (2024-11-29T23:26:34Z)
AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons. We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs. We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [55.33653554387953]
Pattern Analysis and Machine Intelligence (PAMI) has led to numerous literature reviews aimed at collecting and fragmented information.<n>This paper presents a thorough analysis of these literature reviews within the PAMI field.<n>We try to address three core research questions: (1) What are the prevalent structural and statistical characteristics of PAMI literature reviews; (2) What strategies can researchers employ to efficiently navigate the growing corpus of reviews; and (3) What are the advantages and limitations of AI-generated reviews compared to human-authored ones.
arXiv Detail & Related papers (2024-02-20T11:28:50Z)
PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs. We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z)
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks. A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output. This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z)
Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
Ranking Scientific Papers Using Preference Learning [48.78161994501516]
We cast it as a paper ranking problem based on peer review texts and reviewer scores. We introduce a novel, multi-faceted generic evaluation framework for making final decisions based on peer reviews.
arXiv Detail & Related papers (2021-09-02T19:41:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.