Related papers: Towards Personalized Deep Research: Benchmarks and Evaluations

Towards Personalized Deep Research: Benchmarks and Evaluations

URL: http://arxiv.org/abs/2509.25106v1
Date: Mon, 29 Sep 2025 17:39:17 GMT
Title: Towards Personalized Deep Research: Benchmarks and Evaluations
Authors: Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou,
Abstract summary: We introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs)<n>It pairs 50 diverse research tasks with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries.<n>Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research.
Score: 56.581105664044436
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures (P) Personalization Alignment, (Q) Content Quality, and (R) Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

Related papers

Reward Modeling for Scientific Writing Evaluation [50.33952894976367]
It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks.<n>We propose cost-efficient, open-source reward models tailored for scientific writing evaluation.
arXiv Detail & Related papers (2026-01-16T15:32:58Z)
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation [56.886936435727854]
DeepResearchEval is an automated framework for deep research task construction and agentic evaluation.<n>For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles.<n>For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
arXiv Detail & Related papers (2026-01-14T18:38:31Z)
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild [86.6586720134927]
LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
arXiv Detail & Related papers (2025-10-16T02:49:16Z)
Understanding DeepResearch via Reports [41.60038455664918]
DeepResearch is a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration.<n> evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities.<n>We introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports.
arXiv Detail & Related papers (2025-10-09T07:03:43Z)
A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports [24.09178055088843]
Deep Research Agents (DRAs) exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output.<n>This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses.<n>The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness.
arXiv Detail & Related papers (2025-10-02T16:40:02Z)
DRBench: A Realistic Benchmark for Enterprise Deep Research [81.49694432639406]
DRBench is a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings.<n>We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance.
arXiv Detail & Related papers (2025-09-30T18:47:20Z)
Benchmarking Computer Science Survey Generation [18.844790013427282]
SurGE (Survey Generation Evaluation) is a new benchmark for evaluating scientific survey generation in the computer science domain.<n>SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool.<n>In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality.
arXiv Detail & Related papers (2025-08-21T15:45:10Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry [22.615102398311432]
We introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of deep AI research systems.<n>We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios.<n>OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions.
arXiv Detail & Related papers (2025-07-22T06:51:26Z)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [30.768405850755602]
DeepResearch Bench is a benchmark consisting of 100 PhD-level research tasks.<n> evaluating Deep Research Agents is inherently complex and labor-intensive.<n>We propose two novel methodologies that achieve strong alignment with human judgment.
arXiv Detail & Related papers (2025-06-13T13:17:32Z)
Personalized Generation In Large Model Era: A Survey [90.7579254803302]
In the era of large models, content generation is gradually shifting to Personalized Generation (PGen)<n>This paper presents the first comprehensive survey on PGen, investigating existing research in this rapidly growing field.<n>By bridging PGen research across multiple modalities, this survey serves as a valuable resource for fostering knowledge sharing and interdisciplinary collaboration.
arXiv Detail & Related papers (2025-03-04T13:34:19Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Deep Learning for Person Re-identification: A Survey and Outlook [233.36948173686602]
Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings.
arXiv Detail & Related papers (2020-01-13T12:49:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.