Related papers: PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

URL: http://arxiv.org/abs/2601.19916v1
Date: Wed, 07 Jan 2026 04:26:12 GMT
Title: PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review
Authors: Songjun Tu, Yiwen Ma, Jiahao Lin, Qichao Zhang, Xiangyuan Lan, Junfeng. Li, Nan Xu, Linjing Li, Dongbin Zhao,
Abstract summary: We introduce PaperAudit-Bench, which consists of two components: PaperAudit-Dataset, an error dataset, and PaperAudit-Review, an automated review framework.<n>Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths.<n>We show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.
Score: 54.141490756509306
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models can generate fluent peer reviews, yet their assessments often lack sufficient critical rigor when substantive issues are subtle and distributed across a paper. In this paper, we introduce PaperAudit-Bench, which consists of two components: (1) PaperAudit-Dataset, an error dataset covering both errors identifiable within individual sections and those requiring cross-section reasoning, designed for controlled evaluation under long-context settings; and (2) PaperAudit-Review, an automated review framework that integrates structured error detection with evidence-aware review generation to support critical assessment. Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths, highlighting the difficulty of identifying such errors under long-context settings. Relative to representative automated reviewing baselines, incorporating explicit error detection into the review workflow produces systematically stricter and more discriminative evaluations, demonstrating its suitability for peer review. Finally, we show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.

Related papers

AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context [10.769682566098695]
AACR-Bench is a comprehensive benchmark that provides full cross-file context across multiple programming languages.<n>Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects.
arXiv Detail & Related papers (2026-01-27T11:28:44Z)
DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM [35.910677096654574]
Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains.<n>Common practice often selects the top-performing model on standard benchmarks.<n>We introduce DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis.
arXiv Detail & Related papers (2025-12-11T13:16:33Z)
FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers [10.04850395402571]
The identification and localization of errors is a core task in peer review.<n>Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks.<n>Despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored.
arXiv Detail & Related papers (2025-11-26T19:19:44Z)
Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review [36.05498398665352]
We propose an aspect-guided, multi-level perturbation framework to evaluate the robustness of Large Language Models (LLMs) in automated peer review.<n>Our framework explores perturbations in three key components of the peer review process-papers, reviews, and rebuttals-across several quality aspects.
arXiv Detail & Related papers (2025-02-18T03:50:06Z)
Analysing Zero-Shot Readability-Controlled Sentence Simplification [54.09069745799918]
We investigate how different types of contextual information affect a model's ability to generate sentences with the desired readability.<n>Results show that all tested models struggle to simplify sentences due to models' limitations and characteristics of the source sentences.<n>Our experiments also highlight the need for better automatic evaluation metrics tailored to RCTS.
arXiv Detail & Related papers (2024-09-30T12:36:25Z)
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization. Our framework uses a diverse set of LLM prompts to identify factual inconsistencies. We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
Factual Error Correction for Abstractive Summaries Using Entity Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process. RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary. Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.