Related papers: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

URL: http://arxiv.org/abs/2505.19897v1
Date: Mon, 26 May 2025 12:27:27 GMT
Title: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Authors: Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu,
Abstract summary: Large Language Models (LLMs) have extended their impact beyond Natural Language Processing.<n>Among these, computer-using agents are capable of interacting with operating systems as humans do.<n>We introduce ScienceBoard, which encompasses a realistic, multi-domain environment featuring dynamic and visually rich scientific software.
Score: 82.07367406991678
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

Related papers

Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team [53.38438460574943]
IDVSCI is a multi-agent framework built on large language models (LLMs)<n>It incorporates two key innovations: a Dynamic Knowledge Exchange mechanism and a Dual-Diversity Review paradigm.<n>Results show that IDVSCI consistently achieves the best performance across two datasets.
arXiv Detail & Related papers (2025-06-23T07:12:08Z)
SciSciGPT: Advancing Human-AI Collaboration in the Science of Science [7.592219145267612]
Recent advances in large language models (LLMs) and AI agents have opened new possibilities for human-AI collaboration.<n>We introduce SciSciGPT, an open-source, prototype AI collaborator that uses the science of science as a testbed to explore the potential of LLM-powered research tools.
arXiv Detail & Related papers (2025-04-07T23:19:39Z)
Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents [11.74019905854637]
Large language models (LLMs) are evolving into scientific agents that automate critical tasks.<n>Unlike general-purpose LLMs, specialized agents integrate domain-specific knowledge, advanced tool sets, and robust validation mechanisms.<n>We highlight why they differ from general agents and the ways in which they advance research across various scientific fields.
arXiv Detail & Related papers (2025-03-31T13:11:28Z)
Large Language Model Agent: A Survey on Methodology, Applications and Challenges [88.3032929492409]
Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence.<n>This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy.<n>Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time.
arXiv Detail & Related papers (2025-03-27T12:50:17Z)
Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System [62.832818186789545]
Virtual Scientists (VirSci) is a multi-agent system designed to mimic the teamwork inherent in scientific research.<n>VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas.<n>We show that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas.
arXiv Detail & Related papers (2024-10-12T07:16:22Z)
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z)
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work.<n>ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them.<n>We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z)
SciOps: Achieving Productivity and Reliability in Data-Intensive Research [0.8414742293641504]
Scientists are increasingly leveraging advances in instruments, automation, and collaborative tools to scale up their experiments and research goals. Various scientific disciplines, including neuroscience, have adopted key technologies to enhance collaboration, inspiration and automation. We introduce a five-level Capability Maturity Model describing the principles of rigorous scientific operations.
arXiv Detail & Related papers (2023-12-29T21:37:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.