Related papers: SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

URL: http://arxiv.org/abs/2507.05241v2
Date: Tue, 08 Jul 2025 15:54:19 GMT
Title: SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?
Authors: Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, Siheng Chen,
Abstract summary: We introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers.<n>X-Masters sets a new state-of-the-art record on Humanity's Last Exam with a score of 32.1%.
Score: 51.112225746095746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.

Related papers

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training [67.895981259683]
General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence.<n>Current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools.<n>We present Cognitive Kernel-Pro, a fully open-source and (to the maximum extent) free multi-module agent framework.
arXiv Detail & Related papers (2025-08-01T08:11:31Z)
Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce [45.348336032930845]
We introduce a novel framework to assess which occupational tasks workers want AI agents to automate or augment.<n>Our framework features an audio-enhanced mini-interview to capture nuanced worker desires.<n>We construct the WORKBank database to capture preferences from 1,500 domain workers and capability assessments from AI experts.
arXiv Detail & Related papers (2025-06-06T23:05:52Z)
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search [16.93028430619359]
The AI Scientist-v2 is an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper.<n>It iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts.<n>One manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review.
arXiv Detail & Related papers (2025-04-10T18:44:41Z)
"Turing Tests" For An AI Scientist [0.0]
This paper proposes a "Turing test for an AI scientist" to assess whether an AI agent can conduct scientific research independently. We propose seven benchmark tests that evaluate an AI agent's ability to make groundbreaking discoveries in various scientific domains.
arXiv Detail & Related papers (2024-05-22T05:14:27Z)
Position Paper: Agent AI Towards a Holistic Intelligence [53.35971598180146]
We emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions. In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model.
arXiv Detail & Related papers (2024-02-28T16:09:56Z)
Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z)
The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI) We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z)
OpenAGI: When LLM Meets Domain Experts [51.86179657467822]
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents. We introduce OpenAGI, an open-source platform designed for solving multi-step, real-world tasks.
arXiv Detail & Related papers (2023-04-10T03:55:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.