Related papers: Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

URL: http://arxiv.org/abs/2602.03619v1
Date: Tue, 03 Feb 2026 15:09:56 GMT
Title: Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
Authors: Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou,
Abstract summary: We propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation.<n>We first construct a dataset of DeepResearch-style annotated queries with human preferences over paired reports, and train rubric generators via reinforcement learning.<n>We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies.
Score: 80.12435680651488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.

Related papers

SimGR: Escaping the Pitfalls of Generative Decoding in LLM-based Recommendation [68.00727783181289]
A core objective in recommender systems is to accurately model the distribution of user preferences over items to enable personalized recommendations.<n>We observe that existing methods inevitably introduce systematic bias when estimating item-level preference distributions.<n>We propose textbfSimply textbfGenerative textbfRecommendation (textbfSimGR), a framework that directly models item-level preference distributions in a shared latent space.
arXiv Detail & Related papers (2026-02-08T07:26:52Z)
AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research [85.51475655916026]
AgentCPM-Report is a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process.<n>Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines.<n>Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems.
arXiv Detail & Related papers (2026-02-06T09:45:04Z)
MSRS: Evaluating Multi-Source Retrieval-Augmented Generation [51.717139132190574]
Many real-world applications demand the ability to integrate and summarize information scattered across multiple sources.<n>We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources.
arXiv Detail & Related papers (2025-08-28T14:59:55Z)
Test-time Corpus Feedback: From Retrieval to RAG [21.517949407443453]
Retrieval-Augmented Generation (RAG) has emerged as a standard framework for knowledge-intensive NLP tasks.<n>Most RAG pipelines treat retrieval and reasoning as isolated components, retrieving documents once and then generating answers without further interaction.<n>Recent work in both the information retrieval (IR) and NLP communities has begun to close this gap by introducing adaptive retrieval and ranking methods that incorporate feedback.
arXiv Detail & Related papers (2025-08-21T10:57:38Z)
Retrieval-Augmented Recommendation Explanation Generation with Hierarchical Aggregation [5.656477996187559]
Explainable Recommender System (ExRec) provides transparency to the recommendation process, increasing users' trust and boosting the operation of online services.<n>Existing LLM-based ExRec models suffer from profile deviation and high retrieval overhead, hindering their deployment.<n>We propose Retrieval-Augmented Recommendation Explanation Generation with Hierarchical Aggregation (REXHA)
arXiv Detail & Related papers (2025-07-12T08:15:05Z)
Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation [26.158785168036662]
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments.
arXiv Detail & Related papers (2024-07-01T09:09:27Z)
RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation [35.981443744108255]
We propose a novel RAG framework, namely RichRAG. It includes a sub-aspect explorer to identify potential sub-aspects of input questions, a retriever to build a candidate pool of diverse external documents related to these sub-aspects, and a generative list-wise ranker. Experimental results on two publicly available datasets prove that our framework effectively and efficiently provides comprehensive and satisfying responses to users.
arXiv Detail & Related papers (2024-06-18T12:52:51Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
AugTriever: Unsupervised Dense Retrieval and Domain Adaptation by Scalable Data Augmentation [44.93777271276723]
We propose two approaches that enable annotation-free and scalable training by creating pseudo querydocument pairs. The query extraction method involves selecting salient spans from the original document to generate pseudo queries. The transferred query generation method utilizes generation models trained for other NLP tasks, such as summarization, to produce pseudo queries.
arXiv Detail & Related papers (2022-12-17T10:43:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.