Related papers: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

URL: http://arxiv.org/abs/2506.21506v2
Date: Thu, 03 Jul 2025 15:47:40 GMT
Title: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Authors: Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su,
Abstract summary: Agentic search systems autonomously browse the web, synthesize information, and return comprehensive citation-backed answers.<n>Mind2Web 2 is a benchmark of 130 realistic, high-quality, and long-horizon tasks constructed with over 1000 hours of human labor.<n>Our method constructs task-specific judge agents based on a tree-structured design to automatically assess both answer correctness and source attribution.
Score: 34.672897171399775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

Related papers

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z)
Deep Research Agents: A Systematic Examination And Roadmap [79.04813794804377]
Deep Research (DR) agents are designed to tackle complex, multi-turn informational research tasks.<n>In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute DR agents.
arXiv Detail & Related papers (2025-06-22T16:52:48Z)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [30.768405850755602]
DeepResearch Bench is a benchmark consisting of 100 PhD-level research tasks.<n> evaluating Deep Research Agents is inherently complex and labor-intensive.<n>We propose two novel methodologies that achieve strong alignment with human judgment.
arXiv Detail & Related papers (2025-06-13T13:17:32Z)
AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search [58.98450205734779]
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains.<n>Existing agent search methods suffer from three major limitations.<n>We introduce a comprehensive framework to address these challenges.
arXiv Detail & Related papers (2025-06-06T12:07:23Z)
WebDancer: Towards Autonomous Information Seeking Agency [67.08393201285499]
Recent progress in agentic systems underscores the potential for autonomous multi-step research.<n>We present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective.<n>We instantiate this framework in a web agent based on the ReAct, WebDancer.
arXiv Detail & Related papers (2025-05-28T17:57:07Z)
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation [63.55258191625131]
InfoDeepSeek is a new benchmark for assessing agentic information seeking in real-world, dynamic web environments.<n>We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity.<n>We develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes.
arXiv Detail & Related papers (2025-05-21T14:44:40Z)
Level-Navi Agent: A Framework and benchmark for Chinese Web Search Agents [9.003325286793288]
Large language models (LLMs), adopted to understand human language, drive the development of artificial intelligence (AI) web search agents.<n>We propose a general-purpose and training-free web search agent by level-aware navigation, Level-Navi Agent, accompanied by a well-annotated dataset (Web24) and a suitable evaluation metric.
arXiv Detail & Related papers (2024-12-20T08:03:12Z)
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher [20.729251584466983]
We introduce MindSearch to mimic the human minds in web information seeking and integration. The framework can be instantiated by a simple yet effective LLM-based multi-agent framework. MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth.
arXiv Detail & Related papers (2024-07-29T17:12:40Z)
Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.