Related papers: Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

URL: http://arxiv.org/abs/2508.21422v1
Date: Fri, 29 Aug 2025 08:48:00 GMT
Title: Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework
Authors: Nils Dycke, Iryna Gurevych,
Abstract summary: We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
Score: 55.078301794183496
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.

Related papers

DREAM: Deep Research Evaluation with Agentic Metrics [21.555357444628044]
We propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that makes evaluation itself agentic.<n> DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent.<n>Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks.
arXiv Detail & Related papers (2026-02-21T19:14:31Z)
The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z)
FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights [63.32178443510396]
We introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings.<n>Even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning.
arXiv Detail & Related papers (2026-02-02T23:21:13Z)
PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review [54.141490756509306]
We introduce PaperAudit-Bench, which consists of two components: PaperAudit-Dataset, an error dataset, and PaperAudit-Review, an automated review framework.<n>Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths.<n>We show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.
arXiv Detail & Related papers (2026-01-07T04:26:12Z)
BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? [21.78901120638025]
We investigate whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems.<n>Our generator employs presentation-manipulation strategies requiring no real experiments.<n>Despite provably sound aggregation mathematics, integrity checking systematically fails.
arXiv Detail & Related papers (2025-10-20T18:37:11Z)
AI and the Future of Academic Peer Review [0.1622854284766506]
Large language models (LLMs) are being piloted across the peer-review pipeline by journals, funders, and individual reviewers.<n>Early studies suggest that AI assistance can produce reviews comparable in quality to humans.<n>We show that supervised LLM assistance can improve error detection, timeliness, and reviewer workload without displacing human judgment.
arXiv Detail & Related papers (2025-09-17T17:27:12Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review [36.05498398665352]
We propose an aspect-guided, multi-level perturbation framework to evaluate the robustness of Large Language Models (LLMs) in automated peer review.<n>Our framework explores perturbations in three key components of the peer review process-papers, reviews, and rebuttals-across several quality aspects.
arXiv Detail & Related papers (2025-02-18T03:50:06Z)
ReviewEval: An Evaluation Framework for AI-Generated Reviews [9.35023998408983]
The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review.<n>We propose ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines.<n>This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.
arXiv Detail & Related papers (2025-02-17T12:22:11Z)
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples. We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z)
Automated scholarly paper review: Concepts, technologies, and challenges [5.431798850623952]
Recent years have seen the application of artificial intelligence (AI) in assisting the peer review process. With the involvement of humans, such limitations remain inevitable.
arXiv Detail & Related papers (2021-11-15T04:44:57Z)
Generating Summaries for Scientific Paper Review [29.12631698162247]
The increase of submissions for top venues in machine learning and NLP has caused a problem of excessive burden on reviewers. An automatic system for assisting with the reviewing process could be a solution for ameliorating the problem. In this paper, we explore automatic review summary generation for scientific papers.
arXiv Detail & Related papers (2021-09-28T21:43:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.