SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
- URL: http://arxiv.org/abs/2602.12984v1
- Date: Fri, 13 Feb 2026 14:58:18 GMT
- Title: SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
- Authors: Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang,
- Abstract summary: We introduce SciGymAgent, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines.<n>We also present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities.
- Score: 100.12367115920121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.
Related papers
- Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z) - S1-NexusAgent: a Self-Evolving Agent Framework for Multidisciplinary Scientific Research [0.0]
We propose S1-NexusAgent, a self-evolving agent framework for scientific research.<n>S1-NexusAgent adopts a hierarchical Plan-and-CodeAct execution paradigm, decoupling global scientific planning from subtask-level tool execution.<n>S1-NexusAgent achieves state-of-the-art generalization performance, validating its effectiveness and capability in complex scientific tasks.
arXiv Detail & Related papers (2026-02-02T02:33:25Z) - A Cloud-based Multi-Agentic Workflow for Science [0.12314765641075438]
Large Language Models (LLMs) become ubiquitous across various scientific domains.<n>Their lack of ability to perform complex tasks like running simulations or to make complex decisions limits their utility.<n>We present a domain-agnostic, model-independent workflow for an agentic framework that can act as a scientific assistant while being run entirely on cloud.
arXiv Detail & Related papers (2026-01-18T22:37:09Z) - Deploy-Master: Automating the Deployment of 50,000+ Agent-Ready Scientific Tools in One Day [37.83274797886782]
Deploy-Master is a one-stop agentic workflow for large-scale tool discovery, build specification inference, execution-based validation, and publication.<n>In a single day, we performed 52,550 build attempts and constructed reproducible environments for 50,112 scientific tools.<n>We report a deployment trace at the scale of 50,000 tools, characterizing throughput, cost profiles, failure surfaces, and specification uncertainty that become visible only at scale.
arXiv Detail & Related papers (2026-01-07T02:00:13Z) - Bohrium + SciMaster: Building the Infrastructure and Ecosystem for Agentic Science at Scale [82.20980951765891]
We argue that scaling agentic science requires an infrastructure-and-ecosystem approach, instantiated Bohrium+SciMaster.<n>Bohrium acts as a managed, traceable hub for AI4S assets that turns diverse scientific data, software, compute, and laboratory systems into agent-ready capabilities.<n>SciMaster orchestrates these capabilities into long-horizon scientific, on which scientific agents can be composed and executed.
arXiv Detail & Related papers (2025-12-23T16:04:41Z) - UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action [77.63125913907771]
We present UltraCUA, a foundation model that bridges the gap between GUI primitives and high-level programmatic tool calls.<n>Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents.
arXiv Detail & Related papers (2025-10-20T17:48:26Z) - SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration [39.43814195462455]
SciToolAgent automates hundreds of scientific tools across biology, chemistry, and materials science.<n>The agent also incorporates a comprehensive safety-checking module to ensure responsible and ethical tool usage.
arXiv Detail & Related papers (2025-07-27T13:55:35Z) - ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows [82.07367406991678]
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing.<n>Among these, computer-using agents are capable of interacting with operating systems as humans do.<n>We introduce ScienceBoard, which encompasses a realistic, multi-domain environment featuring dynamic and visually rich scientific software.
arXiv Detail & Related papers (2025-05-26T12:27:27Z) - SciAgent: Tool-augmented Language Models for Scientific Reasoning [129.51442677710452]
We introduce a new task setting named tool-augmented scientific reasoning.
This setting supplements Large Language Models with scalable toolsets.
We construct a tool-augmented training corpus named MathFunc which encompasses over 30,000 samples and roughly 6,000 tools.
Building on MathFunc, we develop SciAgent to retrieve, understand and, if necessary, use tools for scientific problem solving.
arXiv Detail & Related papers (2024-02-18T04:19:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.