Related papers: PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

URL: http://arxiv.org/abs/2510.10909v3
Date: Mon, 10 Nov 2025 03:21:21 GMT
Title: PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature
Authors: Daoyu Wang, Mingyue Cheng, Shuo Yu, Zirui Liu, Ze Guo, Qi Liu,
Abstract summary: We propose PaperArena, an evaluation benchmark for large language model (LLM) based agents to address real-world research questions.<n>Given a research question, agents should integrate diverse formats across multiple papers through reasoning and interacting with appropriate tools.<n> Experimental results reveal that even the most advanced LLM powering a well-established agent system achieves merely 38.78% average accuracy.
Score: 11.804526152911386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding and reasoning on the web-scale scientific literature is a crucial touchstone for large language model (LLM) based agents designed to support complex knowledge-intensive tasks. However, existing works are mainly restricted to tool-free tasks within isolated papers, largely due to the lack of a benchmark for cross-paper reasoning and multi-tool orchestration in real research scenarios. In this work, we propose PaperArena, an evaluation benchmark for agents to address real-world research questions that typically require integrating information across multiple papers with the assistance of external tools. Given a research question, agents should integrate diverse formats across multiple papers through reasoning and interacting with appropriate tools, thereby producing a well-grounded answer. To support standardized evaluation, we provide a modular and extensible platform for agent execution, offering tools such as multimodal parsing, context retrieval, and programmatic computation. Experimental results reveal that even the most advanced LLM powering a well-established agent system achieves merely 38.78% average accuracy. On the hard subset, accuracy drops to only 18.47%, highlighting great potential for improvement. We also present several empirical findings, including that all agents tested exhibit inefficient tool usage, often invoking more tools than necessary to solve a task. We invite the community to adopt PaperArena to develop and evaluate more capable agents for scientific discovery. Our code and data are available https://github.com/Melmaphother/PaperArena.

Related papers

MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use [12.220519951554133]
MCPAgentBench is a benchmark based on real-world MCP definitions to evaluate the tool-use capabilities of agents.<n>The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors.<n>Experiments conducted on various latest mainstream Large Language Models reveal significant performance differences in handling complex, multi-step tool invocations.
arXiv Detail & Related papers (2025-12-31T02:09:48Z)
ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers? [29.17900668495058]
We introduce ReplicationBench, an evaluation framework for frontier AI agents.<n>It tests whether agents can replicate entire research papers drawn from the astrophysics literature.<n>R ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks.
arXiv Detail & Related papers (2025-10-28T16:21:19Z)
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z)
InfoAgent: Advancing Autonomous Information-Seeking Agents [143.15973604285304]
We introduce InfoAgent, a deep research agent powered by an innovative data synthesis pipeline and orchestrated web search tools.<n>With our methods, InfoAgent achieves 15.3% accuracy on BrowseComp, 29.2% on BrowseComp-ZH, and 40.4% on Xbench-DS.
arXiv Detail & Related papers (2025-09-29T17:59:57Z)
GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments [56.007498767771075]
GSM-Agent is a novel benchmark for evaluating agentic reasoning in complex environments.<n>We analyze the agentic reasoning patterns by cluster the environment's document embeddings into nodes, and map each tool call to its nearest node.<n>We propose a tool-augmented test-time scaling method to improve LLM's agentic reasoning performance by adding tools to encourage models to revisit.
arXiv Detail & Related papers (2025-09-26T07:24:37Z)
Accelerating Discovery: Rapid Literature Screening with LLMs [1.2586771241101986]
Researchers must review and filter a large number of unstructured sources, which frequently contain sparse information.<n>We developed a Large Language Model (LLM) assistant to support the search and filtering of documents.
arXiv Detail & Related papers (2025-09-16T14:01:44Z)
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents [93.26456498576181]
This paper focuses on the development of native Autonomous Single-Agent models for Deep Research.<n>Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark.
arXiv Detail & Related papers (2025-09-08T02:07:09Z)
RExBench: Can coding agents autonomously implement AI research extensions? [14.147417159347448]
Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously.<n>We argue that research extension and its implementation is a critical capability for such systems.<n>We introduce RExBench to support the evaluation of this capability.
arXiv Detail & Related papers (2025-06-27T19:41:41Z)
Deep Research Agents: A Systematic Examination And Roadmap [109.53237992384872]
Deep Research (DR) agents are designed to tackle complex, multi-turn informational research tasks.<n>In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute DR agents.
arXiv Detail & Related papers (2025-06-22T16:52:48Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [64.86209459039313]
ThinkGeo is an agentic benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs on 486 structured agentic tasks with 1,773 expert-verified reasoning steps.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
LLM Agents Making Agent Tools [2.5529148902034637]
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks.<n>But these tools must be implemented in advance by human developers.<n>We propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools.
arXiv Detail & Related papers (2025-02-17T11:44:11Z)
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers. WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform. BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z)
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.