When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research
- URL: http://arxiv.org/abs/2505.11855v1
- Date: Sat, 17 May 2025 05:45:16 GMT
- Title: When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research
- Authors: Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, Stella Biderman,
- Abstract summary: Large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists.<n>Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists.
- Score: 19.97666809905332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts. In this work, we explore a complementary application: using LLMs as verifiers to automate the \textbf{academic verification of scientific manuscripts}. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings. These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.
Related papers
- SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z) - FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers [10.04850395402571]
The identification and localization of errors is a core task in peer review.<n>Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks.<n>Despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored.
arXiv Detail & Related papers (2025-11-26T19:19:44Z) - BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? [21.78901120638025]
We investigate whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems.<n>Our generator employs presentation-manipulation strategies requiring no real experiments.<n>Despite provably sound aggregation mathematics, integrity checking systematically fails.
arXiv Detail & Related papers (2025-10-20T18:37:11Z) - Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis [15.98124151893659]
Large language models (LLMs) offer potential for automating methodological assessments.<n>We benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles.
arXiv Detail & Related papers (2025-10-12T19:04:22Z) - ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding [49.67493845115009]
ELAIPBench is a benchmark curated by domain experts to evaluate large language models' comprehension of AI research papers.<n>It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval.<n>Experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance.
arXiv Detail & Related papers (2025-10-12T11:11:20Z) - Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration [2.105564340986074]
Sci-SpanDet is a structure-aware framework for detecting AI-generated scholarly texts.<n>It combines section-conditioned stylistic modeling with multi-level contrastive learning to capture human nuanced-AI differences.<n>It achieves state-of-the-art performance, with F1(AI) of 80.17, AUROC of 92.63, and Span-F1 of 74.36.
arXiv Detail & Related papers (2025-10-01T13:35:14Z) - LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation [66.09346158850308]
We present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process.<n>LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles.<n>We evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation.
arXiv Detail & Related papers (2025-10-01T12:14:28Z) - The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems [11.543423308064275]
AI scientist systems are capable of executing the full research workflow from hypothesis generation to paper writing.<n>This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs.<n>We identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias.
arXiv Detail & Related papers (2025-09-10T16:04:24Z) - Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation [0.552480439325792]
We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges.<n> o3 exhibited the best problem identification performance among all models at a modest cost.<n>This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications.
arXiv Detail & Related papers (2025-05-28T06:14:30Z) - Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base [30.705524808195268]
Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge.<n>We propose error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs.<n>SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher.
arXiv Detail & Related papers (2025-03-30T08:33:56Z) - Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review [6.20631177269082]
We introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews.<n>We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs.
arXiv Detail & Related papers (2025-02-26T23:04:05Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - Missci: Reconstructing Fallacies in Misrepresented Science [84.32990746227385]
Health-related misinformation on social networks can lead to poor decision-making and real-world dangers.
Missci is a novel argumentation theoretical model for fallacious reasoning.
We present Missci as a dataset to test the critical reasoning abilities of large language models.
arXiv Detail & Related papers (2024-06-05T12:11:10Z) - Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study [0.28318468414401093]
This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews.<n>Overall, results indicated an accuracy of around 80%, with some variability between domains.
arXiv Detail & Related papers (2024-05-23T11:24:23Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Enhancing Robustness of LLM-Synthetic Text Detectors for Academic
Writing: A Comprehensive Analysis [35.351782110161025]
Large language models (LLMs) offer numerous advantages in terms of revolutionizing work and study methods.
They have also garnered significant attention due to their potential negative consequences.
One example is generating academic reports or papers with little to no human contribution.
arXiv Detail & Related papers (2024-01-16T01:58:36Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - You Are the Best Reviewer of Your Own Papers: The Isotonic Mechanism [1.7741566627076264]
We introduce the Isotonic Mechanism to enhance the accuracy of noisy review scores.<n>Authors with multiple submissions are required to rank their papers in descending order of perceived quality.<n> adjusted scores are shown to be more accurate than the raw scores.
arXiv Detail & Related papers (2022-06-14T14:35:53Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.