Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
- URL: http://arxiv.org/abs/2502.16069v2
- Date: Wed, 26 Feb 2025 02:33:28 GMT
- Title: Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
- Authors: Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Ang Chen,
- Abstract summary: We propose Curie, an AI framework designed to embed rigor into the experimentation process.<n>Curie includes an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability.<n>Compared to the strongest baseline tested, we achieve a 3.4$times$ improvement in correctly answering experimental questions.
- Score: 21.001278669360346
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.
Related papers
- MLGym: A New Framework and Benchmark for Advancing AI Research Agents [51.9387884953294]
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing large language models on AI research tasks.<n>This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents.<n>We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro.
arXiv Detail & Related papers (2025-02-20T12:28:23Z) - Autonomous Microscopy Experiments through Large Language Model Agents [4.241267255764773]
Large language models (LLMs) have accelerated the development of self-driving laboratories (SDLs) for materials research.
Here, we introduce AILA (Artificially Intelligent Lab Assistant), a framework that automates atomic force microscopy (AFM) through LLM-driven agents.
Our systematic assessment shows that state-of-the-art language models struggle even with basic tasks such as documentation retrieval.
arXiv Detail & Related papers (2024-12-18T09:35:28Z) - AutoSciLab: A Self-Driving Laboratory For Interpretable Scientific Discovery [1.1740681158785793]
AutoSciLab is a machine learning framework for driving autonomous scientific experiments.
It forms a surrogate researcher purposed for scientific discovery in high-dimensional spaces.
Applying our framework to an open-ended nanophotonics challenge, AutoSciLab uncovers a fundamentally novel method for directing incoherent light emission.
arXiv Detail & Related papers (2024-12-16T20:41:46Z) - Agents for self-driving laboratories applied to quantum computing [2.840384720502993]
This paper introduces the k-agents framework, designed to support experimentalists in organizing laboratory knowledge and automating experiments with agents.<n>Our framework employs large language model-based agents to encapsulate laboratory knowledge including available laboratory operations and methods for analyzing experiment results.<n>To automate experiments, we introduce execution agents that break multi-step experimental procedures into state machines, interact with other agents to execute each step and analyze the experiment results.
arXiv Detail & Related papers (2024-12-10T23:30:44Z) - Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System [62.832818186789545]
Virtual Scientists (VirSci) is a multi-agent system designed to mimic the teamwork inherent in scientific research.
VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas.
We show that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas.
arXiv Detail & Related papers (2024-10-12T07:16:22Z) - ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery [23.773528748933934]
We extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them.
We unify the target output for every task to a self-contained Python program file.
We propose two effective strategies to mitigate data contamination concerns.
arXiv Detail & Related papers (2024-10-07T14:33:50Z) - DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery.
It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations.
We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z) - MLXP: A Framework for Conducting Replicable Experiments in Python [63.37350735954699]
We propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python.
It streamlines the experimental process with minimal overhead while ensuring a high level of practitioner overhead.
arXiv Detail & Related papers (2024-02-21T14:22:20Z) - Uncertainty Quantification 360: A Holistic Toolkit for Quantifying and
Communicating the Uncertainty of AI [49.64037266892634]
We describe an open source Python toolkit named Uncertainty Quantification 360 (UQ360) for the uncertainty quantification of AI models.
The goal of this toolkit is twofold: first, to provide a broad range of capabilities to streamline as well as foster the common practices of quantifying, evaluating, improving, and communicating uncertainty in the AI application development lifecycle; second, to encourage further exploration of UQ's connections to other pillars of trustworthy AI.
arXiv Detail & Related papers (2021-06-02T18:29:04Z) - Integrated Benchmarking and Design for Reproducible and Accessible
Evaluation of Robotic Agents [61.36681529571202]
We describe a new concept for reproducible robotics research that integrates development and benchmarking.
One of the central components of this setup is the Duckietown Autolab, a standardized setup that is itself relatively low-cost and reproducible.
We validate the system by analyzing the repeatability of experiments conducted using the infrastructure and show that there is low variance across different robot hardware and across different remote labs.
arXiv Detail & Related papers (2020-09-09T15:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.