Related papers: DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

URL: http://arxiv.org/abs/2406.06769v1
Date: Mon, 10 Jun 2024 20:08:44 GMT
Title: DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents
Authors: Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark,
Abstract summary: We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
Score: 49.74065769505137
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automated scientific discovery promises to accelerate progress across scientific domains. However, developing and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and (c) the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents. Code available at: www.github.com/allenai/discoveryworld

Related papers

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows [82.07367406991678]
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing.<n>Among these, computer-using agents are capable of interacting with operating systems as humans do.<n>We introduce ScienceBoard, which encompasses a realistic, multi-domain environment featuring dynamic and visually rich scientific software.
arXiv Detail & Related papers (2025-05-26T12:27:27Z)
Can AI Agents Design and Implement Drug Discovery Pipelines? [1.5848629658789695]
Current AI agent-based systems demonstrate proficiency in solving programming challenges and conducting research. This paper introduces DO Challenge, a benchmark designed to evaluate the decision-making abilities of AI agents. We present the Deep Thought multi-agent system, which demonstrated strong performance on the benchmark, outperforming most human teams.
arXiv Detail & Related papers (2025-04-28T15:41:28Z)
Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents [11.74019905854637]
Large language models (LLMs) are evolving into scientific agents that automate critical tasks. Unlike general-purpose LLMs, specialized agents integrate domain-specific knowledge, advanced tool sets, and robust validation mechanisms. We highlight why they differ from general agents and the ways in which they advance research across various scientific fields.
arXiv Detail & Related papers (2025-03-31T13:11:28Z)
Scaling Laws in Scientific Discovery with AI and Robot Scientists [72.3420699173245]
An autonomous generalist scientist (AGS) concept combines agentic AI and embodied robotics to automate the entire research lifecycle. AGS aims to significantly reduce the time and resources needed for scientific discovery. As these autonomous systems become increasingly integrated into the research process, we hypothesize that scientific discovery might adhere to new scaling laws.
arXiv Detail & Related papers (2025-03-28T14:00:27Z)
HIVEX: A High-Impact Environment Suite for Multi-Agent Research (extended version) [0.0]
HIVEX is an environment suite to benchmark multi-agent research focusing on ecological challenges. We provide environments, training examples, and baselines for the main and sub-tasks. All trained models resulting from the experiments of this work are hosted on Hugging Face.
arXiv Detail & Related papers (2025-01-07T23:16:31Z)
ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks. This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o. The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z)
Two Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation [48.29699224989952]
VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. We show that this multi-agent approach outperforms the state-of-the-art method in producing novel and impactful scientific ideas.
arXiv Detail & Related papers (2024-10-12T07:16:22Z)
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark [11.794931453828974]
CORE-Bench is a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine) We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks.
arXiv Detail & Related papers (2024-09-17T17:13:19Z)
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI [129.08019405056262]
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial Intelligence (AGI) MLMs andWMs have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI.
arXiv Detail & Related papers (2024-07-09T14:14:47Z)
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z)
"Turing Tests" For An AI Scientist [0.0]
This paper proposes a "Turing test for an AI scientist" to assess whether an AI agent can conduct scientific research independently. We propose seven benchmark tests that evaluate an AI agent's ability to make groundbreaking discoveries in various scientific domains.
arXiv Detail & Related papers (2024-05-22T05:14:27Z)
WESE: Weak Exploration to Strong Exploitation for LLM Agents [95.6720931773781]
This paper proposes a novel approach, Weak Exploration to Strong Exploitation (WESE) to enhance LLM agents in solving open-world interactive tasks. WESE involves decoupling the exploration and exploitation process, employing a cost-effective weak agent to perform exploration tasks for global knowledge. A knowledge graph-based strategy is then introduced to store the acquired knowledge and extract task-relevant knowledge, enhancing the stronger agent in success rate and efficiency for the exploitation task.
arXiv Detail & Related papers (2024-04-11T03:31:54Z)
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments [93.94020724735199]
HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind. This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines.
arXiv Detail & Related papers (2024-01-23T18:59:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.