Related papers: Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer's Disease

Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer's Disease

URL: http://arxiv.org/abs/2505.23852v1
Date: Thu, 29 May 2025 01:31:55 GMT
Title: Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer's Disease
Authors: Nic Dobbins, Christelle Xiong, Kristine Lan, Meliha Yetisgen,
Abstract summary: We used the "Quick Access" dataset of the National Alzheimer's Coordinating Center.<n>We identified highly cited published research manuscripts using NACC data.<n>We created a simulated research team of LLM-based autonomous agents tasked with writing and executing code.
Score: 1.9938547353667109
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Objective: To demonstrate the capabilities of Large Language Models (LLMs) as autonomous agents to reproduce findings of published research studies using the same or similar dataset. Materials and Methods: We used the "Quick Access" dataset of the National Alzheimer's Coordinating Center (NACC). We identified highly cited published research manuscripts using NACC data and selected five studies that appeared reproducible using this dataset alone. Using GPT-4o, we created a simulated research team of LLM-based autonomous agents tasked with writing and executing code to dynamically reproduce the findings of each study, given only study Abstracts, Methods sections, and data dictionary descriptions of the dataset. Results: We extracted 35 key findings described in the Abstracts across 5 Alzheimer's studies. On average, LLM agents approximately reproduced 53.2% of findings per study. Numeric values and range-based findings often differed between studies and agents. The agents also applied statistical methods or parameters that varied from the originals, though overall trends and significance were sometimes similar. Discussion: In some cases, LLM-based agents replicated research techniques and findings. In others, they failed due to implementation flaws or missing methodological detail. These discrepancies show the current limits of LLMs in fully automating reproducibility assessments. Still, this early investigation highlights the potential of structured agent-based systems to provide scalable evaluation of scientific rigor. Conclusion: This exploratory work illustrates both the promise and limitations of LLMs as autonomous agents for automating reproducibility in biomedical research.

Related papers

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research [34.173947968362675]
AblationBench is a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research.<n>It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section, and ReviewerAblation, which helps reviewers find missing ablations in a full paper.<n>For both tasks, we develop LM-based judges that serve as an automatic evaluation framework.
arXiv Detail & Related papers (2025-07-09T12:07:38Z)
A Survey of AIOps in the Era of Large Language Models [60.59720351854515]
We analyzed 183 research papers published between January 2020 and December 2024 to answer four key research questions (RQs)<n>We discuss the state-of-the-art advancements and trends, identify gaps in existing research, and propose promising directions for future exploration.
arXiv Detail & Related papers (2025-06-23T02:40:16Z)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training [10.701353329227722]
We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature.<n>Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain.<n>Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain.
arXiv Detail & Related papers (2025-01-25T07:20:44Z)
AD-LLM: Benchmarking Large Language Models for Anomaly Detection [50.57641458208208]
This paper introduces AD-LLM, the first benchmark that evaluates how large language models can help with anomaly detection.<n>We examine three key tasks: zero-shot detection, using LLMs' pre-trained knowledge to perform AD without tasks-specific training; data augmentation, generating synthetic data and category descriptions to improve AD models; and model selection, using LLMs to suggest unsupervised AD models.
arXiv Detail & Related papers (2024-12-15T10:22:14Z)
Enhancing Spectral Knowledge Interrogation: A Reliable Retrieval-Augmented Generative Framework on Large Language Models [0.0]
Large Language Model (LLM) has demonstrated significant success in a range of natural language processing (NLP) tasks within general domain.<n>We introduce the Spectral Detection and Analysis Based Paper (SDAAP) dataset, which is the first open-source textual knowledge dataset for spectral analysis and detection.<n>We also designed an automated Q&A framework based on the SDAAP dataset, which can retrieve relevant knowledge and generate high-quality responses.
arXiv Detail & Related papers (2024-08-21T12:09:37Z)
When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications? [8.89829757177796]
We examine the effectiveness of vector representations from last hidden states of Large Language Models for medical diagnostics and prognostics. We focus on instruction-tuned LLMs in a zero-shot setting to represent abnormal physiological data and evaluate their utilities as feature extractors. Although findings suggest the raw data features still prevails in medical ML tasks, zero-shot LLM embeddings demonstrate competitive results.
arXiv Detail & Related papers (2024-08-15T03:56:40Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study [0.28318468414401093]
This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews.<n>Overall, results indicated an accuracy of around 80%, with some variability between domains.
arXiv Detail & Related papers (2024-05-23T11:24:23Z)
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work.<n>ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them.<n>We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z)
MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks. MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.