Related papers: The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems

The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems

URL: http://arxiv.org/abs/2509.08713v1
Date: Wed, 10 Sep 2025 16:04:24 GMT
Title: The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems
Authors: Ziming Luo, Atoosa Kasirzadeh, Nihar B. Shah,
Abstract summary: AI scientist systems are capable of executing the full research workflow from hypothesis generation to paper writing.<n>This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs.<n>We identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias.
Score: 11.543423308064275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI scientist systems, capable of autonomously executing the full research workflow from hypothesis generation and experimentation to paper writing, hold significant potential for accelerating scientific discovery. However, the internal workflow of these systems have not been closely examined. This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs. In this paper, we identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. To examine these risks, we design controlled experiments that isolate each failure mode while addressing challenges unique to evaluating AI scientist systems. Our assessment of two prominent open-source AI scientist systems reveals the presence of several failures, across a spectrum of severity, which can be easily overlooked in practice. Finally, we demonstrate that access to trace logs and code from the full automated workflow enables far more effective detection of such failures than examining the final paper alone. We thus recommend journals and conferences evaluating AI-generated research to mandate submission of these artifacts alongside the paper to ensure transparency, accountability, and reproducibility.

Related papers

The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z)
FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights [63.32178443510396]
We introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings.<n>Even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning.
arXiv Detail & Related papers (2026-02-02T23:21:13Z)
Let the Barbarians In: How AI Can Accelerate Systems Performance Research [80.43506848683633]
We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems.<n>We demonstrate that ADRS-generated solutions can match or even outperform human state-of-the-art designs.
arXiv Detail & Related papers (2025-12-16T18:51:23Z)
Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper [23.009743151474638]
Jr. AI Scientist is a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher.<n>It generates new research papers that build upon real NeurIPS, IJCV, and ICLR works by proposing and implementing novel methods.
arXiv Detail & Related papers (2025-11-06T17:37:49Z)
BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? [21.78901120638025]
We investigate whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems.<n>Our generator employs presentation-manipulation strategies requiring no real experiments.<n>Despite provably sound aggregation mathematics, integrity checking systematically fails.
arXiv Detail & Related papers (2025-10-20T18:37:11Z)
Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z)
Identity Theft in AI Conference Peer Review [50.18240135317708]
We discuss newly uncovered cases of identity theft in the scientific peer-review process within artificial intelligence (AI) research.<n>We detail how dishonest researchers exploit the peer-review system by creating fraudulent reviewer profiles to manipulate paper evaluations.
arXiv Detail & Related papers (2025-08-06T02:36:52Z)
Towards Improved Research Methodologies for Industrial AI: A case study of false call reduction [0.0]
This work presents a case study on an industrial AI use case called false call reduction for automated optical inspection.<n>We identify seven weaknesses prevalent in related peer-reviewed work and experimentally show their consequences.
arXiv Detail & Related papers (2025-06-17T13:48:38Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research [19.97666809905332]
Large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists.<n>Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists.
arXiv Detail & Related papers (2025-05-17T05:45:16Z)
Autonomous LLM-driven research from data to human-verifiable research papers [0.0]
We build an automation platform that guides interacting through complete stepwise process. In mode provided annotated data alone, datapaper raised hypotheses, designed plans, wrote and interpreted analysis codes, generated and interpreted results. We demonstrate potential for AI-driven acceleration of scientific discovery while enhancing traceability, transparency and verifiability.
arXiv Detail & Related papers (2024-04-24T23:15:49Z)
An Exploratory Study of AI System Risk Assessment from the Lens of Data Distribution and Uncertainty [4.99372598361924]
Deep learning (DL) has become a driving force and has been widely adopted in many domains and applications. This paper initiates an early exploratory study of AI system risk assessment from both the data distribution and uncertainty angles.
arXiv Detail & Related papers (2022-12-13T03:34:25Z)
Bias in Multimodal AI: Testbed for Fair Automatic Recruitment [73.85525896663371]
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data. We train automatic recruitment algorithms using a set of multimodal synthetic profiles consciously scored with gender and racial biases. Our methodology and results show how to generate fairer AI-based tools in general, and in particular fairer automated recruitment systems.
arXiv Detail & Related papers (2020-04-15T15:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.