Related papers: Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents

Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents

URL: http://arxiv.org/abs/2507.05495v1
Date: Mon, 07 Jul 2025 21:35:09 GMT
Title: Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents
Authors: Prahaladh Chandrahasan, Jiahe Jin, Zhihan Zhang, Tevin Wang, Andy Tang, Lucy Mo, Morteza Ziyadi, Leonardo F. R. Ribeiro, Zimeng Qiu, Markus Dreyer, Akari Asai, Chenyan Xiong,
Abstract summary: We introduce Deep Research Comparator, a platform that offers a holistic framework for evaluating deep research agents.<n>Given a user query, our platform displays the final reports from two different agents along with their intermediate steps during generation.<n>Annotators can evaluate the overall quality of final reports based on side-by-side comparison.
Score: 30.401980824941003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports remains a major challenge, particularly when it comes to assessing long reports and giving detailed feedback on their intermediate steps. To address these gaps, we introduce Deep Research Comparator, a platform that offers a holistic framework for deep research agent hosting, side-by-side comparison, fine-grained human feedback collection, and ranking calculation. Given a user query, our platform displays the final reports from two different agents along with their intermediate steps during generation. Annotators can evaluate the overall quality of final reports based on side-by-side comparison, and also provide detailed feedback separately by assessing intermediate steps or specific text spans within the final report. Furthermore, we develop Simple Deepresearch, an end-to-end agent scaffold. This scaffold serves as a baseline that facilitates the easy integration of various large language models to transform them into deep research agents for evaluation. To demonstrate the platform's utility for deep research agent development, we have collected real user preference data from 17 annotators on three deep research agents. A demo video of our platform can be found at https://www.youtube.com/watch?v=g4d2dnbdseg.

Related papers

Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue [61.0689885044492]
We explore the challenges for realizing dialog agents that can effectively assist meta-reviewers.<n>We first address the issue of data scarcity for training dialogue agents.<n>We utilize this data to train dialogue agents tailored for meta-reviewing.
arXiv Detail & Related papers (2025-08-07T11:27:43Z)
Characterizing Deep Research: A Benchmark and Formal Definition [24.523394260858822]
We propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems.<n>We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process.
arXiv Detail & Related papers (2025-08-06T08:09:28Z)
Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z)
Deep Research Agents: A Systematic Examination And Roadmap [79.04813794804377]
Deep Research (DR) agents are designed to tackle complex, multi-turn informational research tasks.<n>In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute DR agents.
arXiv Detail & Related papers (2025-06-22T16:52:48Z)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [30.768405850755602]
DeepResearch Bench is a benchmark consisting of 100 PhD-level research tasks.<n> evaluating Deep Research Agents is inherently complex and labor-intensive.<n>We propose two novel methodologies that achieve strong alignment with human judgment.
arXiv Detail & Related papers (2025-06-13T13:17:32Z)
DeepShop: A Benchmark for Deep Research Shopping Agents [70.03744154560717]
DeepShop is a benchmark designed to evaluate web agents in complex and realistic online shopping environments.<n>We generate diverse queries across five popular online shopping domains.<n>We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects.
arXiv Detail & Related papers (2025-06-03T13:08:17Z)
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research [25.368303145176554]
DeepResearchGym is an open-source sandbox that combines a search API with a rigorous evaluation protocol for benchmarking deep research systems.<n>The API indexes large-scale public web corpora, namely ClueWeb22 and FineWeb, using a state-of-the-art dense retriever and approximate nearest neighbor search via DiskANN.<n>It achieves lower latency than popular commercial APIs while ensuring stable document rankings across runs, and is freely available for research use.
arXiv Detail & Related papers (2025-05-25T18:16:13Z)
Decomposed Opinion Summarization with Verified Aspect-Aware Modules [82.38097397662436]
We propose a domain-agnostic modular approach guided by review aspects.<n>We conduct experiments across datasets representing scientific research, business, and product domains.
arXiv Detail & Related papers (2025-01-27T09:29:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.