Related papers: ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

URL: http://arxiv.org/abs/2507.16280v1
Date: Tue, 22 Jul 2025 06:51:26 GMT
Title: ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry
Authors: Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, Pengfei Liu,
Abstract summary: We introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of deep AI research systems.<n>We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios.<n>OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions.
Score: 22.615102398311432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.

Related papers

AI4Research: A Survey of Artificial Intelligence for Scientific Research [55.5452803680643]
We present a comprehensive survey on AI for Research (AI4Research)<n>We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research.<n>We identify key research gaps and highlight promising future directions.
arXiv Detail & Related papers (2025-07-02T17:19:20Z)
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z)
A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications [3.002468101812191]
We analyze more than 80 commercial and non-commercial implementations that have emerged since 2023.<n>We propose a novel hierarchical taxonomy that categorizes systems according to four fundamental technical dimensions.<n>Our analysis reveals both the significant capabilities of current implementations and the technical and ethical challenges they present.
arXiv Detail & Related papers (2025-06-14T18:19:05Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
AI-Researcher: Autonomous Scientific Innovation [13.58669328864436]
We introduce AI-Researcher, a fully autonomous research system that transforms how AI-driven scientific discovery is conducted and evaluated.<n>Our framework seamlessly orchestrates the complete research pipeline--from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation.
arXiv Detail & Related papers (2025-05-24T13:54:38Z)
Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Directions [0.0]
Agentic AI systems are capable of reasoning, planning, and autonomous decision-making.<n>They are transforming how scientists perform literature review, generate hypotheses, conduct experiments, and analyze results.
arXiv Detail & Related papers (2025-03-12T01:00:05Z)
From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems [40.10425916520717]
In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research.<n>This paper presents a systematic review of the progress in this domain.<n>We organize relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication.
arXiv Detail & Related papers (2025-03-03T11:27:13Z)
Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation [58.064940977804596]
A plethora of new AI models and tools has been proposed, promising to empower researchers and academics worldwide to conduct their research more effectively and efficiently.<n>Ethical concerns regarding shortcomings of these tools and potential for misuse take a particularly prominent place in our discussion.
arXiv Detail & Related papers (2025-02-07T18:26:45Z)
A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning [51.7818820745221]
Underwater image enhancement (UIE) presents a significant challenge within computer vision research. Despite the development of numerous UIE algorithms, a thorough and systematic review is still absent.
arXiv Detail & Related papers (2024-05-30T04:46:40Z)
SurveyAgent: A Conversational System for Personalized and Efficient Research Survey [50.04283471107001]
This paper introduces SurveyAgent, a novel conversational system designed to provide personalized and efficient research survey assistance to researchers. SurveyAgent integrates three key modules: Knowledge Management for organizing papers, Recommendation for discovering relevant literature, and Query Answering for engaging with content on a deeper level. Our evaluation demonstrates SurveyAgent's effectiveness in streamlining research activities, showcasing its capability to facilitate how researchers interact with scientific literature.
arXiv Detail & Related papers (2024-04-09T15:01:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.