DR-Arena: an Automated Evaluation Framework for Deep Research Agents
- URL: http://arxiv.org/abs/2601.10504v1
- Date: Thu, 15 Jan 2026 15:28:21 GMT
- Title: DR-Arena: an Automated Evaluation Framework for Deep Research Agents
- Authors: Yiwen Gao, Ruochen Zhao, Yang Deng, Wenxuan Zhang,
- Abstract summary: Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis.<n>Current benchmarks predominantly rely on static datasets, which suffer from limited task generality, temporal misalignment, and data contamination.<n>We introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation.
- Score: 35.99095633093855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.
Related papers
- OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions [66.84396313837765]
We introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions.<n>We provide a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery.<n>We also introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons.
arXiv Detail & Related papers (2026-02-05T16:31:43Z) - IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning [54.21689544323704]
Deep Research (DR) agents extend Large Language Models (LLMs) beyond parametric knowledge.<n>Unlike real-time conversational assistants, DR is computationally expensive and time-consuming.<n>We propose IntentRL, a framework that trains proactive agents to clarify latent user intents before starting long-horizon research.
arXiv Detail & Related papers (2026-02-03T12:43:09Z) - Step-DeepResearch Technical Report [90.50586290399683]
We introduce Step-DeepResearch, a cost-effective, end-to-end agent.<n>We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing.<n>To bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios.
arXiv Detail & Related papers (2025-12-23T16:32:27Z) - A Hierarchical Tree-based approach for creating Configurable and Static Deep Research Agent (Static-DRA) [0.0]
This paper introduces the Static Deep Research Agent (Static-DRA), a novel solution built upon a hierarchical Tree-based static workflow.<n>The core contribution is the integration of two user-tunable parameters, Depth and Breadth, which provide granular control over the research intensity.<n>The agent's architecture, comprising Supervisor, Independent, and Worker agents, facilitates effective multi-hop information retrieval.
arXiv Detail & Related papers (2025-12-03T15:37:13Z) - Hierarchical Deep Research with Local-Web RAG: Toward Automated System-Level Materials Discovery [16.491889842339617]
Long-horizon, hierarchical deep research (DR) agent designed for complex materials and device discovery problems.<n>Our framework instantiates a locally deployable DR instance that integrates local retrieval-augmented generation with large language model reasoners.<n>We systematically evaluate across 27 nanomaterials/device topics using a large language model (LLM)-as-judge with five web-enabled state-of-the-art models as jurors.
arXiv Detail & Related papers (2025-11-23T05:57:42Z) - ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [11.666923792025313]
Deep Research (DR) is an emerging agent application that leverages large language models to address open-ended queries.<n>We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor.<n>We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration.
arXiv Detail & Related papers (2025-11-10T23:07:14Z) - GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians [32.33432636089606]
Current benchmarks for AI clinician systems fail to capture the depth, robustness, and safety required for real-world clinical practice.<n>We introduce the GAPS framework, a multidimensional paradigm for evaluating textbfGrounding (cognitive depth), textbfAdequacy (answer completeness), textbfPerturbation (robustness), and textbfSafety.<n>We develop a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end.
arXiv Detail & Related papers (2025-10-15T16:40:28Z) - DODO: Causal Structure Learning with Budgeted Interventions [1.0323063834827415]
We introduce DODO, an algorithm defining how an Agent can autonomously learn the causal structure of its environment.<n>Results show better performance for DODO, compared to observational approaches, in all but the most limited resource conditions.
arXiv Detail & Related papers (2025-10-09T13:32:33Z) - Dynamic Data Pruning for Automatic Speech Recognition [58.95758272440217]
We introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers fine-grained pruning granularities specifically tailored for speech-related datasets.
Our experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss.
arXiv Detail & Related papers (2024-06-26T14:17:36Z) - How to Train Your DRAGON: Diverse Augmentation Towards Generalizable
Dense Retrieval [80.54532535622988]
We show that a generalizable dense retriever can be trained to achieve high accuracy in both supervised and zero-shot retrieval.
DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations.
arXiv Detail & Related papers (2023-02-15T03:53:26Z) - Centralizing State-Values in Dueling Networks for Multi-Robot
Reinforcement Learning Mapless Navigation [87.85646257351212]
We study the problem of multi-robot mapless navigation in the popular Training and Decentralized Execution (CTDE) paradigm.
This problem is challenging when each robot considers its path without explicitly sharing observations with other robots.
We propose a novel architecture for CTDE that uses a centralized state-value network to compute a joint state-value.
arXiv Detail & Related papers (2021-12-16T16:47:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.