Related papers: DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

URL: http://arxiv.org/abs/2509.04499v1
Date: Tue, 02 Sep 2025 00:32:38 GMT
Title: DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence
Authors: Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu,
Abstract summary: Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices.<n>We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations.
Score: 50.97612134791782
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.

Related papers

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents [76.29382561831105]
Deep Research agents generate explicit natural language reasoning before each search call.<n> Reasoning-Aware Retrieval embeds the agent's reasoning trace alongside its query.<n>DR- Synth generates Deep Research retriever training data from standard QA datasets.<n>AgentIR-4B achieves 68% accuracy with the open-weight agent Tongyi-DeepResearch.
arXiv Detail & Related papers (2026-03-04T18:47:26Z)
How Far Are We from Genuinely Useful Deep Research Agents? [48.596990593729]
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis.<n>Current benchmarks for report synthesis suffer from task complexity and subjective metrics.<n>We present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks.
arXiv Detail & Related papers (2025-12-01T17:58:59Z)
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild [86.6586720134927]
LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
arXiv Detail & Related papers (2025-10-16T02:49:16Z)
Understanding DeepResearch via Reports [41.60038455664918]
DeepResearch is a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration.<n> evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities.<n>We introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports.
arXiv Detail & Related papers (2025-10-09T07:03:43Z)
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research [73.58638285105971]
This paper tackles textbfopen-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports.<n>We introduce textbfWebWeaver, a novel dual-agent framework that emulates the human research process.<n>Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym.
arXiv Detail & Related papers (2025-09-16T17:57:21Z)
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks [14.371010711040304]
ReportBench is a benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs)<n>Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports.
arXiv Detail & Related papers (2025-08-14T03:33:43Z)
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [30.768405850755602]
DeepResearch Bench is a benchmark consisting of 100 PhD-level research tasks.<n> evaluating Deep Research Agents is inherently complex and labor-intensive.<n>We propose two novel methodologies that achieve strong alignment with human judgment.
arXiv Detail & Related papers (2025-06-13T13:17:32Z)
WebThinker: Empowering Large Reasoning Models with Deep Research Capability [60.81964498221952]
WebThinker is a deep research agent that empowers large reasoning models to autonomously search the web, navigate web pages, and draft research reports during the reasoning process.<n>It also employs an textbfAutonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time.<n>Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems.
arXiv Detail & Related papers (2025-04-30T16:25:25Z)
The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations [40.498553309980764]
We study the interplay between verifiability and utility of information-sharing tools. We find that users prefer search engines over large language models for high-stakes queries.
arXiv Detail & Related papers (2024-11-26T12:34:52Z)
Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions [89.35345649303451]
Generative search engines have the potential to transform how people seek information online. But generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system.
arXiv Detail & Related papers (2024-02-25T11:22:19Z)
Evaluating Verifiability in Generative Search Engines [70.59477647085387]
Generative search engines directly generate responses to user queries, along with in-line citations. We conduct human evaluation to audit four popular generative search engines. We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations.
arXiv Detail & Related papers (2023-04-19T17:56:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.